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1.  Introduction 


Maximum  likelihood  (ML)  estimation  is  a  popular  approach  in  solving  signal  processing 
problems,  especially  in  scenarios  with  a  large  data  set,  where  the  maximum  likelihood 
estimator  (MLE)  is  in  many  senses  optimal  due  to  its  asymptotic  characteristics.  The 
procedure  simply  relies  on  maximizing  the  likelihood  equation,  and,  in  analytically 
intractable  cases,  the  MLE  can  still  be  obtained  iteratively  through  methods  of 
optimization,  e.g.,  via  the  method  of  scoring.  However,  in  many  signal  processing 
problems,  it  is  desirable  or  necessary  to  perform  ML  estimation  when  side  information  is 
available.  Often  this  additional  information  is  in  the  form  of  parametric  equality  or 
inequality  constraints  on  a  subset  of  the  parameters.  Examples  of  side  information  include 
the  constant  modulus  property,  some  known  signal  values  (semi-blind  problems),  restricted 
power  levels  (e.g.,  in  networks),  known  angles  of  arrival,  array  calibration,  some  forms  of 
precoding,  and  so  on.  With  the  addition  of  these  parametric  constraints,  this  procedure  is 
now  called  constrained  maximum  likelihood  (CML)  estimation,  with  the  solution  being  the 
constrained  maximum  likelihood  estimator  (CMLE). 

As  a  measure  of  performance  for  the  MLE,  the  Cramer- Ran  bound  (CRB),  obtained  via 
the  inverse  of  the  Fisher  information  matrix  (FIM),  is  the  lower  bound  of  the  error 
covariance  of  any  unbiased  estimator.  However,  it  is  desirable  to  measure  performance  of 
estimators  that  satisfy  the  side  information  constraints.  Gorman  and  Hero  used  the 
Chapman- Robbins  bound  to  develop  a  constrained  version  of  the  CRB  which  lower  bounds 
the  error  covariance  for  constrained,  unbiased  estimators  for  the  case  when  the 
unconstrained  model  has  a  nonsingular  FIM  (1).  Marzetta  simplified  their  derivation  and 
formulation  for  the  same  nonsingular  FIM  case  (£).  Then,  Stoica  and  Ng  constructed  a 
more  general  formulation  of  the  CRB  that  incorporates  the  constraint  information  without 
the  assumption  of  a  full-rank  FIM  ( 3 ).  Their  constrained  CRB  (CCRB)  was  also  shown  to 
subsume  the  previous  cases  which  require  a  nonsingular  FIM. 

The  MLE  is  an  optimal  choice  for  an  estimator  in  the  sense  that  asymptotically  the  MLE 
is  both  consistent  and  efficient  (4).  For  the  case  of  linear  equality  constraints,  Osborne 
showed  that  this  result  is  preserved  with  the  error  covariance  of  the  CMLE  approaching 
the  CCRB  (5).  Osborne’s  result  was  obtained  independently  and  is  further  confirmation  of 
the  CCRB  result  in  (5),  although  the  CCRB  is  not  discussed  in  (5).  We  extend  this 
asymptotic  normality  result  for  the  CMLE  to  the  more  general  nonlinear  constraint  case. 
Specifically,  we  show  that  the  CMLE  is  also  consistent  and  asymptotically  efficient  with 
respect  to  the  CCRB. 

Although  the  ML  problem  is  easy  to  express,  obtaining  the  MLE  is  often  a  difficult  task. 
Fortunately,  iterative  techniques,  such  as  the  method  of  scoring  (4,6),  are  available  which 
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reach  the  MLE  under  certain  conditions.  Typically,  scoring  is  applied  on  top  of  an  existing 
method  which  provides  the  intialization  (7,8).  However,  those  schemes  must  be  adjusted 
when  constraints  have  been  added  to  the  model.  Prior  research  has  focused  on  developing 
iterative  techniques  based  on  Lagrangian  methods  to  obtain  the  CMLE  (5,9,10).  However, 
convergence  properties  and  criteria  are  overlooked. 

The  constrained  scoring  algorithm  (CSA)  developed  here  is  a  generalization  of  the  method 
of  scoring  to  obtain  the  CMLE  under  certain  conditions.  The  scheme  relies  on  alternating 
between  a  projection  step,  similar  to  gradient  based  descent  iterations,  and  a  restorative 
step  which  ensures  that  the  solution  satisfies  the  parametric  constraints.  We,  furthermore, 
detail  several  convergence  properties  associated  with  this  CSA.  Convergence  of  iterative 
techniques  are  always  dependent  on  the  initialization,  but  the  results  obtained  here  show 
that  this  method  obtains  at  least  a  local  MLE.  Thus,  with  a  sufficiently  accurate 
initialzation,  the  CSA  will  in  fact  obtain  the  CMLE. 

We  provide  several  examples  to  demonstrate  the  effectiveness  of  the  CSA.  First,  we 
examine  the  classical  CMLE  problem  when  imposing  linear  constraints  on  a  linear  model. 
The  CSA  analytically  solves  this  problem  in  a  single  step  and  provides  an  equivalent 
alternative  to  the  traditional  answer  (e.g.,  see  (4,  p.  252)  and  (11,  p.  299)),  where  the  new 
solution  is  applicable  under  weaker  conditions.  We  also  demonstrate  that  our  CMLE  and 
the  traditional  solution  are  both  unbiased  and  efficient.  Second,  we  find  the  CMLE  after 
imposing  nonlinear,  nonconvex  constraints  on  the  signal  modulus  in  a  complex-valued 
linear  model.  Provided  the  initialization  is  sufficiently  close,  our  simulations  show  evidence 
of  unbiasedness  and  efficiency  for  this  particular  choice  of  constraint  as  well.  Third,  we 
consider  a  nonlinear  model  parameterization  from  (12).  In  this  case,  contant  modulus  and 
semiblind  constraints  are  applied  on  the  signal  that  passes  through  an  instantaneous 
mixing  channel. 

This  paper  is  developed  as  follows:  In  the  following  section,  we  formally  state  the  CML 
problem  and  give  definitions  for  necessary  terms  used  throughout.  In  section  3,  we 
determine  the  asymptotic  normality  properties  of  the  CMLE,  showing  both  consistency 
and  asymptotic  efficiency.  Next,  in  section  4,  we  develop  the  CSA  via  the  method  of 
Lagrange  multipliers  and  natural  projections.  In  section  5,  we  discuss  the  convergence 
properties  of  the  given  CSA.  In  sections  6  and  7,  we  provide  some  examples  that  illustrate 
the  effectiveness  of  the  CSA. 


2.  Preliminaries 

In  this  and  subsequent  sections,  we  use  the  following  notation:  Scalars  will  be  in  lowercase, 
vectors  in  bold  font,  and  matrices  in  uppercase  bold  font  (e.g.,  a  is  a  scalar,  a  a  vector,  and 
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A  a  matrix),  where  it  is  understood  that  (-)j  will  denote  the  ith  element  of  a  vector  or 
row/column  of  a  matrix  and  (-)y  will  denote  the  ith  row,  jth  column  element  of  a  matrix. 
We  will  denote  (-)T,  (•)*,  and  (-)H  as  the  transpose,  the  conjugate  and  the  conjugate 
transpose,  respectively,  of  either  a  vector  or  matrix.  For  matrices,  (-)-1  is  the  matrix 
inverse  and  (-p  the  pseudoinverse.  All  vectors  will  be  column  vectors.  Sets  will  be  denoted 
in  capital  Greek  letters,  and  all  sets  will  be  a  subset  of  a  Euclidean  metric  space,  i.e., 

(Mn,  || -||)  for  some  positive  integer  n  and  where  ||-||  is  the  L2-norm.  When  applied  to 
matrices,  ||-||  will  be  the  Frobenius  norm. 

2.1  Problem  Statement  and  Definitions 


We  have  a  vector  of  observations  x  in  a  sample  space  12  C  MM  satisfying  the  likelihood 
function  of  a  known  form  p(x\  9).  We  want  to  estimate  the  unknown  9  parameter  vector 
under  the  assumption  that  9  is  restricted  to  a  closed,  convex  set  ©  C  M.N ,  which  we  will 
assume  can  be  defined  parametrically.1  Let  0o  e  ©  be  the  true  vector  of  parameters.  Then 
the  CMLE  9(x)  is  given  by 

9{x)  =  argma xp(x\9).  (1) 


Since  —  log(-)  is  strictly  monotone  decreasing,  this  CMLE  can  alternately  be  viewed  as  the 
solution  to  the  following  constrained  optimization  problem 


nun  —log  p(x;9) 

(2) 

s.t.  f{9)=  0 

(3) 

9(0)  <  0 

(4) 

where  the  the  negative  log-likelihood  function  is  the  objective  function,  and  /  :  lw  — >  MA 
and  g  :  lw  — ■>  are  the  functional  constraints  which  define  the  constraint  set,  i.e., 

©  =  {0  :  f{9)  =  0,  g(9)  <  0}.  We  make  the  assumption  that  p(x ;  9),  f(9),  and  g(9)  all 
have  continuous  second  derivatives  with  respect  to  9.  Derivatives  will  be  expressed  either 
as  Ve(-)  or  as  We  also  require  that  the  functional  constraints  be  consistent,  i.e., 

©  7^  0.  All  expectations  will  be  with  respect  to  the  appropriate  distribution  of  the 
likelihood,  i.e.  Ee(-)  =  fyeti(-)p(y,9)dy. 

A  point  9  which  satisfies  the  functional  constraints  is  said  to  be  feasible ,  and  ©  is  referred 
to  as  the  feasible  region.  The  ith  inequality  constraint  gi(9)  <  0  is  said  to  be  active  at  a 
feasible  point  9  if  gi{9)  =  0,  otherwise  it  is  inactive.  Thus,  the  equality  constraints  f(9) 
are  always  considered  active.  A  feasible  point  9  is  a  regular  point  of  the  constraints  in  f 
and  the  active  constraints  in  g  if  the  vectors  Vgfi(9),  Vggj(9),  l<i<K,l<j<  L',  are 

1The  condition  that  ©  be  convex  is  not  so  much  a  strict  requirement  as  it  is  a  convenient  one.  More 
discussion  on  this  issue  is  found  in  section  4.3. 
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linearly  independent,  where  we  assume  only  the  first  L'  constraints  of  g  are  active.  Thus,  a 
regular  point  requires  no  redundancy  in  the  active  constraints.  Properties  for  these  terms 
can  be  found  in  (13,14). 

Define  the  gradient  matrices  of  f  and  g  by  the  continuous  functions 

F(S)  =  and  G(0)  =  Vjg(fl). 


Note  that,  assuming  6  is  regular  in  the  active  constraint  set,  F(9)  has  full  row  rank  K, 
whereas  G(0)  is  not  necessarily  so.  Define  U  :  lw  — >  MArxJ  to  be  a  continuous  function 
such  that  for  each  9,  U(6)  is  a  matrix  whose  columns  form  an  orthonormal  null  space  of 
the  range  space  of  the  row  vectors  in  F(Q ),  i.e.,  such  that 

F(0)U(0)  =  0  and  UT(6)U(0)  =  IJxJ  (5) 

for  every  9  E  WN .  If  9  is  regular  then  U  :  WN  — >•  M.NxN~K ,  i.e.,  then  J  —  N  —  K  since  F(0) 
is  full  row  rank.  Also,  note  that  U  is  independent  of  9  whenever  F  is.  This  occurs  in  the 
linearly  constrained  case,  when  f(0)  =  F6  +  v  for  some  matrix  F  E  M,NxN  and  vector 
v  E  M,v.  So,  the  gradient  F(0)  is  a  constant  F  and,  thus,  U(0)  =  U . 

2.2  The  Constrained  Cramer-Rao  Lower  Bound 

The  error  covariance  of  any  unbiased  estimator  9(x)  of  9  is  bounded  by  the  Cramer-Rao 
Lower  Bound  (CRB).  The  classical  development  of  the  Cramer-Rao  Lower  Bound  (CRB)  is 
well-known  (4,15).  The  bound  is  expressed  as 

Ee((9(x)  -  9)(9(x)  -  0)T)  >  I~\0)  (6) 


where  I (9)  is  the  FIM  given  by 


1(9)  =  Ee 


<9  log p(x;  9)  <9  log p(x;  9) 
89  8(F 


(7) 


The  CRB  and  FIM  exist  provided  the  following  regularity  conditions2  hold: 


E, 


<9  log p(x;  9) 
'  89 


=  0  and  EEe6(X)  =  EeS{x)E^l. 


(8) 


2In  the  statistical  inference  literature  a  point  0  that  satisfies  these  regularity  conditions  is  sometimes 
said  to  be  regular,  or  information-regular  (16).  To  avoid  the  confusion  between  that  definition  and  the 
optimization  literature’s  definition  in  the  prior  subsection,  all  instances  of  regular  in  this  paper  will  be  in  the 
optimization  context  and  all  instances  of  regularity  will  be  in  the  statistical  inference  context. 
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Note  these  must  hold  for  all  0.  Additionally,  if  we  also  have  that 

^  <92log p(x;0)  d  ^  9  log p(x;0) 

6  d6doT  =  ddLe  ’ 


then  we  can  use  an  alternate  expression  for  the  FfM: 

d 2  log p(x;  0) 


1(0)  =  -Ee 


dOdOT 


(9) 


(10) 


From  equation  6,  the  existence  of  the  CRB  requires  a  nonsingular  FIM  as  well.  When  the 
model  is  not  locally  identifiable,  however,  then  the  FfM  is  singular  and  the  CRB  is  not 
always  defined.3  To  obtain  a  CRB,  the  model  must  include  sufficient  constraints  on  the 
parameters  to  achieve  identifiability  and,  hence,  a  nonsingular  FIM.  The  choice  of 
constraints  are  completely  dictated  by  the  model,  and,  thus,  applying  such  constraints 
previously  required  a  reevaluation  of  the  FIM  on  some  appropriate  dimension-reducing 
reparameterization  of  0  that  includes  this  constraint  information.  A  bound  is  then 
obtained  by  a  transformation  from  the  reparameterization  back  to  0.  Since  the  reevaluated 
FIM  depends  on  the  reparameterization,  i.e.,  it  needs  to  be  evaluated  for  every 
reparameterization  or  unique  set  of  constraints,  the  classical  Fisher  Information  theory 
does  not  incorporate  the  constraint  information  in  any  convenient  closed  form.  Even  when 
the  original  model  is  identifiable,  the  classical  theory  ignores  the  contribution  of  the  side 
information  in  the  form  of  a  specified  constraint  set. 

To  overcome  this  deficiency,  Gorman  and  Hero  ( 1 )  and  then  Marzetta  (2)  developed 
formulations  of  the  CRB  which  do  include  constraint  information  for  the  case  where  the 
FIM  is  full-rank.  Improving  on  their  work,  Stoica  and  Ng  (3)  formulated  a  CCRB  that 
explicitly  incorporates  the  active  constraint  information  with  the  original  FIM,  singular  or 
nonsingular. 

Theorem  1  (Stoica  Ng  (<?)).  Assume  we  know  the  active  constraints  and  that  these 
are  all  incorporated  into  f .  Let  0(x)  be  an  unbiased  estimate  of  0  satisfying  the  active 
functional  constraints  in  equation  3.  Then,  under  certain  regularity  conditions,  if 
UT(0)I(0)U(0)  is  nonsingular, 

Ee((0(x)  -  0)(0(x)  -  0)T)  >  U(0)(UT(0)I(0)U(e))-1UT(0)  =  B(0)  (11) 

where  equality  is  achieved  if  and  only  if  (in  the  mean  square  sense) 

0(x)-0  =  U(0)(UT(0)I(0)U(0))~lUT(0)W\ogp(x-0). 


In  the  evaluation  of  this  CCRB,  the  inactive  inequality  constraints  do  not  contribute  any 
side  information.  This  is  so  because  we  made  the  assumption  that  the  inequality 

3Often,  the  pseudoinverse  E  (6)  is  used,  but  this  bound  is  not  always  valid  except  under  certain  conditions 
(17). 
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constraints  are  inactive  (1),  i.e.,  only  active  constraints  affect  the  outcome  of  the 
estimator.  Also  note  that  rather  than  requiring  a  non-singular  FIM  1(0),  the  alternate 
condition  is  that  U1  ( O)I(0)U(0 )  be  non-singular.  Thus,  the  unconstrained  FIM  may  still 
be  singular,  or,  equivalently,  the  unconstrained  model  unidentifiable,  but  the  constrained 
model  must  be  identifiable,  at  least  locally. 

The  natural  extensions  to  the  classical  model,  e.g.,  the  CRB  on  differentiable 
transformations  (4,  p.  45)  and  the  CRB  for  biased  bounds,  also  can  be  applied  to  this 
constrained  formulation  (12). 


3.  Asymptotic  Normality  of  the  CMLE 

Asymptotic  properties  of  the  MLE  can  be  found  in  (4).  This  motivates  the  desire  to  obtain 
corresponding  results  for  the  CMLE.  In  particular,  we  wish  to  show  results  on  asymptotic 
consistency  and  efficiency  for  the  constrained  case.  For  asymptotic  consistency,  we  will  rely 
on  the  Kullback-Leibler  information  (4).  For  asymptotic  efficiency,  since  the  constrained 
maximum  likelihood  problem  equation  1  is  equivalent  to  the  constrained  optimization 
problem  equations  2  through  4,  we  will  use  the  tools  of  optimization  theory.  The  main 
results  of  this  section,  equations  18  and  32,  can  be  summarized  as  follows. 

Theorem  2.  Assuming  the  pdf  p(x;0)  satisfies  certain  regularity  conditions,  e.g.,  as  in 
Theorem  1,  then  the  CMLE  6n  is  asymptotically  distributed  according  to 

6n~M(0o,B(9o))  (12) 

where  0O  is  the  true  parameter  vector. 


Proof.  Suppose  we  observe  the  data  set  (aq, . . . ,  xn)  where  each  ay  is  independently 
distributed  as  x ,  i.e.,  we  observe  n  iid  samples  from  a  distribution  of  the  known  form 
p( x,  0),  in  order  to  estimate  6.  And,  again,  0O  is  the  true  parameter  vector.  For  this 
section,  denote  0n(aq, . . .  ,xn)  as  the  CMLE  based  on  n  samples.  For  convenience,  in  this 
section,  the  CMLE  will  also  often  be  denoted  0n.  Then,  the  CMLE  on  n  samples 
maximizes  the  joint  density  of  (aq, . . .  ,xn),  i.e., 

n 

0n(x i,  •  •  • ,  xn)  =  arg max  TT p( aq,  G).  (13) 

0e©  -LJ- 

i=l 


Or,  equivalently,  since  log  is  monotone, 


Gn  (*®1 ,  •  •  •  ,  Xn  ) 


1 

arg  max  — 

0e©  n 


log  Y[p(xi,0) 
1=1 


1 

arg  max  — 

0e©  n 


n 

J2\ogp(xi,0). 
i=  1 


(14) 

(15) 
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As  n  — >  oo,  by  the  law  of  large  numbers  we  have  that 

On(xii  •  ■  •  i  *n)  -»■  arg  max  log p(x\ 9) 

ee& 


(16) 


The  Kullback-Liebler  information  satisfies 


Ee  log 


p(x-,e) 


p{x-4>) 


(17) 


with  equality  if  and  only  if  9  = 
with  equality  if  and  only  if  9  = 
is  9a.  That  is,  as  n  — >•  oo  then 


< p .  Thus,  we  have  that  E0o  logp(x;  9)  <  Ego  log p(x;  9a ) 

0o.  Thus,  the  solution  of  the  maximization  in  equation  16 

0n(aq,..,,ccn)  ->  0o,  (18) 


and  the  CMLE  is  consistent. 

Next  we  determine  the  asymptotic  covariance  characteristics  of  6n  employing  tools  from 
optimization  theory.  One  such  useful  tool  in  converting  the  constrained  optimization 
problem  equation  2  into  an  unconstrained  problem  is  the  method  of  Lagrange  multipliers. 
The  method  develops  a  Lagrangian  function  which  incorporates  the  objective  and 
constraints  so  that  solutions  of  the  optimization  problem  must  be  stationary  points  of  this 
function.4  The  Lagrangian  of  equation  2  is 

C[Q,n,v)  =  -logp(x;G)  +  nTf{0)  +  uTg{d).  (19) 


The  vectors  pi  G  MA  and  v  G  ML  are  the  Lagrange  multipliers  of  the  function.  Any  potential 
solution  of  equation  2  must  be  a  stationary  point  of  equation  19,  i.e.,  it  must  be  a  point  9* 
satisfying  the  following  Karush-Kuhn- Tucker  (KKT)  necessary  conditions  (18,  p.243): 


vj£(r,  *.',!/•)  =  /(»*)  =  0 

9(9*)  <  o 

v'  >  0 

v‘Tg(0‘)  =  0 

VJ£(e•,M*,^*)  =  -Vpogp(a:;e•)+^^•'^^’(e•)  +  ^/•TG(e•)  =  0. 


(20) 

(21) 

(22) 

(23) 

(24) 


Note  the  first  two  conditions,  equations  20  and  21,  are  simply  the  constraints  equations  3 
and  4  from  the  constrained  optimization  problem.  Also,  equation  23  implies  that  v*  is 
nonzero  if  and  only  if  the  constraint  gt  is  active.  Either  u*  or  gi(6*)  is  zero,  exclusively,  so 

4 A  commonly  used  alternative  optimality  condition  for  convex  sets  is  discussed  in  appendix  A. 
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equations  21  through  23  actually  define  L  explicit  equations.  Since  equation  20  defines  K 
equations  and  equation  24  defines  N  equations,  this  system  above  contains  exactly 
K  +  L  +  N  equations  with  K  +  L  +  N  unknowns  ( 9 *,  Where  these  equations  are 

not  redundant,  i.e.,  at  a  regular  point,  the  solution  (0*,/j,*,v*)  is  locally  unique.  This 
solution  (9*,  n*,  u*),  however,  is  typically  not  analytically  tractable  and  requires  a 
numerical  optimization  approach. 

The  last  equation  can  be  conveniently  simplified  by  considering  only  the  active  constraints 
at  a  stationary  point  9*,  for  similar  reasons  given  in  the  CCRB  development.  So  for  any 
stationary  point,  we  will  assume  that  all  the  active  constraints  are  already  incorporated 
into  /  and  modify  F  and  U  accordingly.  Hence,  we  can  rewrite  equation  24  as 

— Vg  logp(x;  9*)  +  fi*TF(9*)  =  0.  (25) 


This  implies  that  the  gradient  of  the  log-likelihood  is  in  the  range  space  of  the  gradient  of 
the  active  equality  constraints  at  the  stationary  point.  Hence,  geometrically,  the  direction 
of  steepest  descent  of  the  objective  function  must  be  orthogonal  to  the  tangent  plane  of  f 
at  stationary  points  of  the  Lagrangian.  And  since  9n  is  a  stationary  point,  equations  20 
through  25  hold  at  9n.  From  equation  5,  F(9n)U(9n )  =  0,  so  equation  25  implies  that  a 
necessary  condition  for  9n  to  be  the  CMLE  is  for 

WTe  log p(x\  9n)U{9n)  =  0.  (26) 


For  a  moment,  it  will  be  convenient  to  consider  the  the  equivalent  condition  of  equation  26 
in  vector  notation,  i.e.,  for  1  <  j  <  J, 

Vg  log p(x]  9n)uj(9n)  =  0  (27) 

where  Uj{9 )  is  the  jth  column  of  U(9).  Now  let  9  be  a  point  near  9n.  The  Taylor 
expansion  (19)  of  equation  27  about  9  evaluated  at  9n  gives  us 


0  =  Vg  \ogp(x]9)uj(9)  +uj(9)\/2g\ogp(x-,9)(9n  -  9) 

+Vq  logp(£c;  9)V1euj(9)(9n  -  9)  +  o(l) 

=  Ve  log p(x;  9)[uj(9)  +  V1euj(9)(9n  -  9)  +  o(l)] 
+uJ(9)V29  log p(x;  9)(9n  -9)  +  o(l) 

=  V9logp(x-,9)uj(9n)  +  uj(9)V2e\ogp(x;9)(9n  -  9)  +  o(l) 
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(28) 


where  the  o(l)  term  is  the  sum  of  the  higher  order  terms  of  (Gn  —  0),  which  vanishes  as 
6n  —  0  — >  0.  Collecting  the  vectors  in  equation  28  in  matrix  notation,  we  have 

0  =  UT(0n)Ve  log P(x-  6 )  +  UT(O)V20  log p(x;  0){0n  -  0)  +  o(l).5  (29) 

Since  the  CMLE  is  consistent  from  equation  18,  then  for  sufficiently  large  n,  the  CMLE  is 
close  to  the  true  parameter  vector  0o,  and  so  the  equation  is  satisfied  for  the  true 
parameter  vector  0O,  i.e.,  when  0  =  0O.  So, 

0  =  UT (0n)V e  log p(x\ 60)  +  UT  (Go)V20\ogp(x-,Go)(Gn  -  0O)  +  o(l).  (30) 

Recall  that  Ga  is  also  subject  to  the  active  (equality)  constraints,  i.e.,  it  is  a  feasible  point 
just  as  0„  is.  Since  ©  is  connected,  there  exists  a  path-connected  curve  on  the  surface  of 
/(0)  =  0  including  the  points  j Gn  :  n  —  1,  2, . . .  j  and  60.  Such  a  curve,  in  the 
optimization  context,  is  called  a  feasible  arc  since  every  point  on  the  curve  satisfies  the 
equality  constraints.  Let  the  continuously  differentiable  map  G  :  M  — >  ©  denote  this 
feasible  arc  such  that  0(0)  =  0O  and  *(£)  =  for  each  n.  This  arc  reflects  the  consistency 
result  since  Gn  =  0(-)  — >  0(0)  =  0O  as  n  — >  oo.  Since  /(0(f))  =  0  for  all  t,  then 

0=4 /(«(*))  =^f(mfe(t)  =F(e(0))fe(t)  =  F(ec)fe(t)  . 

ai  t=0  ai  t=0  az  t=0  cli  t=0 

Thus,  from  equation  5,  40(t)  G  span  U(G0).  Hence,  using  the  Lagrange  remainder 

t= o 

form  for  the  Taylor  series  (19),  we  have  that 

Gn  -0O  =  G(-)  -  0(0)  =  -U(0(s(n)))qn  (31) 

n  n 

for  some  0  <  s(n)  <  4  and  some  qn  G  MJ.  Substituting  equation  31  into  30,  we  have 

0  =  U1  (0n)V0  log p(x;  G0)  -  UT(GO)—V20  log p(x\  G0)U ( 0(s(n)))qn  +  o(l). 

n 

Given  the  regularity  condition  of  Theorem  1  at  0O,  then  for  sufficiently  large  n  we  have 
that  the  matrix  UT(Go)^V0logp(x;  0o)U(0(s(  n) ) )  is  invertible  for  0  in  a  neighborhood  of 
0O.  Therefore,  solving  for  qn  we  hnd  that 

qn  =  (^UT (G0)^X72elogp(x;  0o)U(6(s(n)))^  UT(0n)V0logp(x-,6o)  +  o(l). 

5By  o(l)  we  mean  a  vector  which  has  all  its  elements  o(l). 
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Substituting  qn  back  into  equation  31  yields 


On  ~0O  — 

-U(6(s(n))) 

n 


UT{G0)-V2e\ogp{x-,G0)U{G{s{n))) 

n 


UT(Gn)Velogp(x;  0o)  +  o(l), 


or 

Vn(0n  ~  Go)  = 

U(0(s(n)))  ( UT  (G0)—'Vg\ogp(x;  G0)U(G(s(n)))\  UT(Gn)-^Ve  \ogp(x;  Ga)  +  o(l). 
V  n  j  \Jn 


Now,  let  n  — >  oo.  Then  Gn  — >  G0  by  asymptotic  consistency  from  equation  18,  o(l)  — >  0, 
and  — -V|log p(x;  G0)  — >  I(0O)  by  the  law  of  large  numbers,  i.e.,  since 
^Velog p(x;G0)  ~  A/”(0, 1(G0))  (4).  Also  U(Gn)  — >  U(G0)  by  continuity,  and 
U(0(s(  n )))  — >  U(0o)  by  the  pinching  theorem  and  continuity.  Thus,  we  have  that  as 
n  — >  oo, 

U(0(s(n)))  ^UT(G0)^Vllogp(x;G0)U(G(s(n)))^  UT(Gn ) 

^  U(G0)(Ut(G0)I(G0)U (G0))~1Ut(G0)  a.s. 


From  the  asymptotic  normality  result  on  the  MLE  (^,15),  Cov(-^Velog p(x;Ga))  — »■  J(0O) 
as  n  ^  oo  where  the  convergence  is  convergence  in  distribution.  And  thus,  by  Slutsky’s 
theorem  (15),  we  have  that 

Coveo(M0n  -  Oo))  -  C/(0o)(C/T(0o)/(0o)C/(0o))“1C/T(0o)  =  m)  (32) 


where  the  convergence  is  in  distribution.  Since  this  is  the  CCRB  in  equation  11,  this  shows 
that  the  CMLE  is  asymptotically  efficient.  The  result  in  equation  18,  combined  with 
equation  32,  prove  the  theorem.  □ 


Theorem  2  is  parallel  to  the  classical  asymptotic  normality  property  for  the  classical 
(unconstrained)  maximum  likelihood  estimate  (4  )•  And,  this  serves  as  another  verification 
of  the  CCRB  result  in  equation  11.  In  fact,  in  the  absence  of  constraints,  then 
U(G0 )  =  I nxN  since  /  is  null,  and  provided  the  FIM  is  nonsingular,  we  have  the  classical 
asymptotic  MLE  result  as  well.  The  result  in  equation  12  was  earlier  shown  by  Osborne 
(5),  but  only  for  linear  constraints,  and  no  link  was  made  regarding  the  constrained  CRB. 
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4.  Scoring  with  Constraints 


An  analytic,  closed-form  solution  of  the  MLE  is  sometimes  found  from  the  first  order 
conditions  on  the  log-likelihood  (the  KKT  equations  equations  20  through  24  with  null 
constraint  functions),  i.e.,  by  solving  for  G  in 


d 

dG 


\ogp(x-,d) 


0=0 


0. 


Solutions  are  the  stationary  points,  and  if  a  unique  solution  exists,  it  is  the  MLE.  Similarly, 
a  solution  for  the  CMLE  can  be  found  by  solving  for  the  zero  point  of  the  arc  G(t)  satisfying 


d 

dt 


log  p(x;0(t)) 


t= o 


0 


for  any  feasible  arc  G  :  M  — »  ©  parameterized  by  t.  Again,  if  a  unique  solution  exists,  0(0) 
is  the  CMLE. 

In  general,  however,  analytic  solutions  of  the  MLE  and  CMLE  problem  are  unavailable. 
This  motivates  the  use  of  iterative  procedures  to  attain  the  CMLE.  In  this  section,  we 
derive  a  generalization  of  the  classical  scoring  algorithm  by  incorporating  the  side 
information  contained  in  the  constraints.  This  new  method  will  include  a  projection  step 
and  a  restoration  step  to  ensure  that  each  of  the  iterates  remains  both  feasible  and  usable. 

4.1  The  Projection  Step 

The  simplest  iterative  schemes  are  gradient  methods  of  the  form 

Gft+l  T  CYfe  •  dki  (33) 

where  a^  >  0  and  dk  E  are  suitably  chosen  sequences  of  step  sizes  and  step  directions. 
Before  continuing,  it  is  appropriate  to  define  a  few  terms  relating  to  such  iterative  schemes 
(13, If, 20, 21, 22).  Given  a  feasible  point  G ,  a  feasible  step  consists  of  a  step  direction  d  and 
step  size  a  such  that  G  +  ad  is  also  a  feasible  point.  Thus,  by  suitably  chosen,  we  mean  in 
part  that  the  sequences  a^  and  d^  are  chosen  such  that  0/c  is  feasible  for  every  positive 
integer  k.  However,  not  every  sequence  of  feasible  steps  results  in  a  sequence  of  feasible 
points  which  converges  to  a  (local)  minimum  of  the  objective  function.  Thus,  a  usable  step 
consists  of  a  step  direction  d  and  step  size  a  such  that  the  corresponding  evaluation  of  the 
objective  is  less  than  that  of  the  previous  iterate,  i.e., 

—  logp(ay  G)  >  —  logp(:r;  G  +  ad). 
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So,  a  suitably  chosen  sequence  of  step  sizes  and  step  directions  is  a  sequence  of  feasible  and 
usable  steps. 

Of  particular  importance  in  the  class  of  gradient  methods  is  the  subclass  of  Quasi-Newton 
methods  of  the  form 

0k+i  =  0k-  akDkVe  log p(x;  0k) 

where  Dk  is  a  sequence  of  positive  definite  symmetric  matrices.  In  general,  the  gradient  is 
of  the  objective  function,  which  in  our  case  is  the  negative  log-likelihood  function.  Since 
the  gradient  of  the  objective  function  is  the  direction  of  steepest  ascent,  it’s  negative  is  the 
direction  of  steepest  descent.  The  matrix  Dk  then  projects  this  direction  vector  to  some 
purpose,  dictated  by  the  model.  This  class  of  methods  includes 

1.  the  method  of  steepest  descent,  where  Dk  =  InxN, 

2.  Newton’s  method,  where  Dk  =  (V|log p(x;0k))~1,  as  well  as, 

3.  the  method  of  scoring,  where  Dk  =  — I(0k ). 

The  method  of  steepest  descent  often  leads  to  slow,  linear  convergence  rates.  Newton’s 
method  uses  a  quadratic  approximation  to  converge  quadraticly,  and  thus  faster,  but 
requires  a  second-order  differentiation.  The  method  of  scoring  asymptotically  stabilizes  the 
iteration  statistically  by  exchanging  the  Hessian  of  the  objective  with  it’s  expected  value, 
the  negative  FIM.  As  evident  in  the  equalities  in  equations  17  and  10,  this  also  removes  the 
need  for  computing  a  second-order  derivative,  albeit  in  exchange  with  the  required 
evaluation  of  an  expectation  (integral).  None  of  these  methods,  however,  consider  the 
information  and  restrictions  from  the  constraints. 

To  incorporate  constraints6,  we  return  to  the  necessary  conditions  for  a  stationary  point  of 
the  Lagrangian,  in  particular  equations  25  and  20, 

-Ve  logp(ay  0{x))  +  fiT (x)F(d(x))  =  0  (34) 

f(0(x))  =  0.  (35) 


Again,  if  0(x)  is  regular  then  the  above  equations  completely  determine  the  CMLE; 
however,  the  solution  is  difficult  to  obtain  analytically.  Let  6i  e  ©  be  a  given  point  near 
the  CMLE  0{x)  ;  so  6 1  is  a  sufficiently  close  approximation  of  0{x)  which  also  satisfies  the 

6There  a  numerous  ways  to  incorporate  these  constraints,  e.g.,  using  a  directional  Taylor  approximation 
of  the  likelihood  as  discussed  in  appendix  B,  applying  the  method  of  Newton  elimination,  or  using  a  general 
Taylor  approximation  of  the  Lagrangian  optimality  condition  as  discussed  here. 
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constraints.  Now,  consider  the  Taylor  approximations  of  equations  34  and  35  about  6\ 
evaluated  at  a  nearby  point  0: 


-Ve\ogp(x;9i)  -  Vlhgp(x-,6i)(e  -  8y)  +  FT(6)p,  =  0 

/(»,) +  *■(»,)(« -0,)  -  0. 


(36) 

(37) 


Since  0(x)  is  locally  unique  in  equations  34  and  35,  then  by  dropping  the  higher  order 
terms  and  yet  still  forcing  the  equality  with  0,  9  would  be  a  closer  point  to  9(x)  than  9\  is. 
But  we  still  need  to  solve  for  6  to  obtain  this  better  approximation.  To  add  to  this 
difficulty,  /i  is  also  unknown  since  it  is  an  approximation  of  (x( x)  which  is  also  unknown. 

In  matrix  form,  equations  36  and  37  are 


-V|logp(®;0i) 

CD 

'0  -  0! 

V0  log p(x-,  0 1)  -  FT(6i)ni 

F(0i) 

0 

_M  -  Mi. 

-/(0 1) 

where  we  exchanged  a  term  in  equation  36  via  the  first  order  approximation 
FT(9i)fi  ~  Fr(9)n  for  small  /i  ( 9 )  and  added  —  Ft(9i)hi  to  both  sides,  where  Hi  is  a 
chosen  initialization.  (The  choice  of  Hi  will  be  irrelevant  to  our  end  result.)  The  matrix  is 
commonly  referred  to  as  the  KKT  matrix,  and  the  system  as  a  KKT  system  (18).  By 
approximating  the  negative  Hessian  with  the  FIM,  we  have 


'm) 

FT(9 1)‘ 

'0-0!' 

Velogp(;r;01)-JFr(01)/i1' 

F(9 1) 

0 

M  -  Mi. 

-/(0 1) 

(39) 


So,  if  either  the  KKT  matrix  is  nonsingular,  or  0i  is  regular  (so  that  ^(^i)  has  full  row 
rank)  and  the  FIM  is  nonsingular,  then  one  of  the  coefficient  matrices  on  the  LHS  of 
equations  38  and  39  must  be  nonsingular.  Premultiplying  equation  38  or  39  by  the  inverse 
of  their  respective  coefficient  matrix,  if  it  exists,  and  solving  for  0  and  /i  leads  to  an 
iterative  scheme  where  (0,  fi)  are  the  updates  of  the  given  (0!,  Hi).  Such  a  scheme 
iteratively  estimates  the  desired  parameter  estimate  9(x)  as  well  as  the  accompanying 
Lagrange  multipliers  fi(x)  which  satisfies  equation  34.  Such  a  method  is  presented  in 
(9, 10). 7  We  do  not  desire  such  a  method,  since,  in  addition  to  increasing  the  number  of 
parameters  needed  to  be  estimated,  the  solution  (9(x),  £i(x),v(x))  may  not  be  locally 
unique,  hampering  convergence  stopping  criteria,  even  when  the  solution  of  9(x )  is  unique. 

Note  that  if  the  constraints  /  are  linear  and  the  negative  log-likelihood  is  quadratic,  then 
there  are  no  higher  order  terms  and  equations  40  and  41  (as  well  as  equations  36  and  37) 

7It  is  interesting  to  note  that  these  authors  use  a  formula  ( 9 ,  p.823)  in  their  scheme  that  was  later  found 
by  Marzetta  to  be  a  variation  of  the  CCRB  formula  applicable  when  the  FIM  for  the  unconstrained  model 
is  invertible  (2). 
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do  not  give  approximations,  but  are  exact,  i.e.,  then  0  =  6{x).  Thus,  solving  for  0  given  0i 
can  be  done  in  one  step  in  this  case  (see  section  6.1). 

However,  if  G\  is  not  a  regular  point  then  the  constraints  include  redundancy  and  the 
coefficient  matrices  are  not  invertible  (as  can  be  seen  from  their  Schur  complements). 
Likewise,  if  the  FIM  1(6 1)  is  singular,  then  the  statistical  KKT  matrix  in  equation  39  is 
not  invertible.  So  it  is  desirable  to  find  a  scheme  which  does  not  rely  on  inverting  these 


possibly  singular  matrices.  We  can  rewrite  equation  39  as 

I(91)(9-91)  +  Ft(91)(Pl-Pl1)  =  Vo\ogP(x-Ol)-FT(el)vi  (40) 

F(0O(0-0i)  =  f  (6\ ) .  (41) 

Premultiplying  equation  40  by  UJ  (0i),  we  have 

UT(91)I(91)(9  -  00  =  UT(6l)We\ogp(x,el).  (42) 

Solutions  of  (0  —  00  in  equation  42  are  of  the  form 

0  -  0x  =  u(el)(uT(el)i(el)u(e1))-lu(el)Ve  \ogP(x-  00  +  *7  (43) 


where  rj  G  Null(J(0O).  Since,  again,  0  is  a  better  approximation  of  6(x)  than  01?  this 
alone  motivates  an  iterative  scheme  for  obtaining  the  CMLE  6(x).  However,  if  possible,  we 
would  like  to  incorporate  equation  41  to  eliminate  the  variable  77  and  find  the  unique 
solution  to  the  approximated  system  in  equation  39. 

Define  a  continuous  map  V  :  lw  — >  MNx(N~J'>  such  that,  for  every  0  G  RN,  V (0)  is  a 
matrix  whose  columns  form  an  orthonomal  basis  of  the  range  space  of  the  row  vectors  of 
F(6).  Equivalently,  the  columns  of  V(0)  form  an  orthonormal  basis  of  the  nullspace  of 
U{0)  for  each  0  G  RN .  Thus,  UT(6)V(6)  =  0,  VT(0)V(0)  =  I{N-j)x(n-j)  and, 
consequently,  C/(0)t/i  (0)  +  V(9)V2  (0)  =  Inxn ■  Using  this  identity  of  Inxn  in  equation 
42, 

UT(91)I(61)(UT(e1)U(61)  +  VT(01)V(01))(0  -  91)  =  1 7T(9l)Ve  \ogP(x:  91).  (44) 

By  definition,  V(0)  =  FT (9)11(9)  for  some  map  R  :  lw  — >  ^Kx(N~J) ;  so  using  equation 
41  we  have  that 

Vt(91)(9  -  00  =  Rt(91)F(91)(9  -  00  =  -RT(91)f(91). 

But  0i  G  ©  so  f  (0 1)  —  0.  Thus,  V —  $i)  —  0  us  well,  und.  Gquution  44  is  now 

Ut(91)I(91)U(91)Ut(91)(9  -  0X)  =  C/r(0i)Ve logp(*;  00-  (45) 
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Multiplying  ( U 7  (O,)!  (Oi)U  (0,))  1  to  both  sides  of  equation  45,  we  have 


t/T(01)(0-01)  =  (UT(01)I(01)U(01)y1UT(0l)We\ogp(x-01). 


Thus,  the  additional  information  from  equation  41  eliminates  the  rj  term  in  equation  43, 
and  the  solution  is 

0-0,  =  (U(01)UT(O1)  +  V(01)VT(0l))(0-0l) 

=  t/(01)t/T(01)(0-01) 

=  U (Oi)(UT (0,)I (Oi)U (Oi))~1UT (0i)\7 elogp(x]  0,). 


So  if  G  ©  is  a  close  approximation  of  the  CMLE  0(x)  then  O2  =  0  is  a  closer  one.  This 
prompts  the  following  iterative  scheme,  a  variation  on  the  method  of  scoring 

0k+1  =  Ok  +  U(Ok)(UT(Ok)I(Ok)U(Ok))-lUT(Ok)^e\ogp(x-Ok) 

=  0k  +  B(Ok)V 0  log p(x;Ok)  (46) 


which,  instead  of  using  the  CRB  as  the  projection  matrix,  uses  the  CCRB8.  It  is  important 
to  note  here  that  this  is  not  another  Quasi-Newton  method,  as  the  corresponding 
premultiplying  matrix  Dk  is  not  positive  definite,  but  rather  postive  semi-definite.  Indeed, 
for  U (0k)(UT (Ok)I(Ok)U (0k))~1UT (Ok)  to  be  positive  definite,  U(0k)  would  necessarily  be 
full  row  rank  which  requires  that  F(0k )  be  null,  i.e.,  an  unconstrained  model, 
corresponding  to  the  classical  method  of  scoring. 

This  constrained  scoring  formulation,  as  experienced  with  the  Newton  or  classical  scoring 
algorithm,  does  not  necessarily  lead  to  convergent  sequences.  To  better  control  this 
behavior,  an  additional  variational  parameter  ak  is  introduced  to  control  the  step  size.  The 
modification  is 

Ok+i  =  Ok  +  akB(6k)Ve  logp(a;;  0k)  (47) 

where  the  step  size  ak  satisfies  some  appropriate  step  size  rule  that  will  guarantee  usability 
and  will  stabilize  the  convergence.  Another  motivation  for  this  addition  is  the  lack  of 
knowledge  of  what  Lipschitz  condition  the  objective  satisfies.  If  we  did  have  that 
knowledge,  we  could  possibly  choose  a  fixed  step  size  ak  =  s  that  enables  the  iteration  to 
be  a  local  contraction  mapping.  Since  that  information  is  not  known,  we  instead  must 
employ  a  rule  with  a  variable  step  size  (22),  such  as: 

8 A  special  case  of  equation  46  for  linear  constraints  was  presented  in  (5).  And  a  specific  formulation  of 
equation  46  exists  for  the  conventional  optimization  problem,  again  with  linear  constraints,  in  (20,  p.  178). 
Since  that  problem  is  non-statistical,  in  place  of  the  FIM  is  the  Hessian  of  the  Lagrangian.  Our  problem  is 
statistical,  and  so  we  instead  estimate  the  Hessian  with  the  FIM. 
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1.  A  minimization  rule. 


2.  A  diminishing  step  size  rule. 

3.  A  successive  step-size  rule  (e.g.,  the  Armijo  rule). 

The  minimization  rule  chooses  the  optimal  ak  that  minimizes  the  corresponding  objective 
for  each  iterate.  This  rule  relies  on  a  one-line  search  for  each  iterate.  A  possible  variation  is 
to  restrict  ak  to  some  finite  interval.  The  diminishing  step  size  rule  chooses  a  sequence 
{ak}  with  the  restriction  that  ak  — >  0  while  Yl<k=\  ak  =  oo.  A  particular  step  size,  however, 
might  result  in  a  greater  negative  log-likelihood,  so  this  method  still  requires  a  check  to 
guarantee  usability.  If  a  particular  step  is  not  usable,  the  rule  might  be  adjusted  to  skip 
sufficient  elements  of  the  sequence  until  the  step  is  usable.  The  Armijo  rule  chooses  a 
sequence  defined  by 

ak  =  rks  (48) 

where  mk  is  the  first  nonnegative  integer  m  such  that 

log p(x;Gk+1)  -  log p{x,9k)  >  \\6k  -  Gk+ 1||2  (49) 

and  cr,  ft,  and  .s  are  fixed  scalars  with  0<cr<l,  0</3<l,  and  0  <  s.  If  6k  is  a  stationary 
point,  then  ak  is  set  to  0.  The  check  in  equation  49  guarantees  that  the  steps  of  each 
iterate  are  usable. 

4.2  The  Restoration  Step 

To  eliminate  the  rj  variable  in  equation  43,  we  used  the  fact  that  the  first  iterate  6 1  was  a 
point  in  the  constraint  set  ©.  However,  the  next  iterate  d2  is  not  guaranteed  to  satisfy  the 
constraints  since  it  is  the  solution  of  the  Taylor  approximations  in  equations  36  and  37. 

The  projection  matrix  is  constant,  so  between  iterates  the  search  (over  ak)  is 
one-dimensional  or  linear,  which  moves  away  from  any  nonlinear  constraint.  In  fact,  it  is 
only  guaranteed  to  satisfy  the  constraints  if  they  are  linear.  Thus,  while  the  sequence 
generated  by  equation  47  may  converge,  it  may  not  converge  to  a  point  in  the  constraint 
set  ©.  The  process  to  correct  this  error  is  called  the  restoration  step. 

We  restore  the  second  iterate  back  onto  the  constraint  set  ©  using  a  projection.  Since  ©  is 
convex,  the  projection  theorem  favors  using  the  natural  projection  for  this  step.  When  the 
projection  is  the  natural  one,  the  projection  theorem  says  it  is  uniquely  determined.  Let 
7T  :  — >  ©  be  the  natural  projection  of  MA-space  onto  ©.  Then  the  method  of  scoring 

with  constraints,  or  the  CSA,  is  given  by 

Qk+ 1  =  Gk{ak)  =  7T [0k  +  akB(6k)Vg  log p(x;  0k)\  (50) 
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where  «fc  satisfies  one  of  the  previously  listed  step  size  rules.  This  is  similar  to  a  two-metric 
projection  method  which  typically  has  improved  performance  over  simple  gradient 
projections  (22). 

It  is  desirable  that  the  projection  be  a  somewhat  simple  operation,  e.g.,  planar  or  spherical 
restrictions,  so  as  not  to  be  a  computational  burden.  However,  constraint  sets  may  arise 
where  the  projection  cannot  be  expressed  analytically.  For  these  scenarios,  we  present  the 
following  scheme  to  minimize  the  error  away  from  ©  to  a  predetermined  acceptable  level 
(20). 

Suppose  9^\  for  some  i,  is  a  regular  point  generated  by  the  RHS  of  equation  47  not  in  the 
constraint  set  ©.  Then  f(0[l))  ±  0.  Let  Gi  be  the  nearest  point  to  that  does  satisfy 
the  constraints,  so  that  m)  =  o.  Then  a  Taylor  approximation  of  f(0i)  =  0  about  0 j1* 
evaluated  at  6  is  given  by 

o  =  /(ei<1))  +  F(e<1))(e-e<I>). 

As  with  equation  37,  we  have  in  6  a  point  closer  to  0L  than  6^  is.  Solutions  of  9  are  of  the 
form 

g  =  gm  _  FT(e<1>)(F(e<1VT(»l1)))"1/(«.a>)  +  C 

where  £  G  Null(C/T(0^)).  Note  the  requirement  that  0[ 1  ’  be  regular  is  necessary  for  the 
inverse  of  F(0[l>)Fr  (0^)  to  exist.  The  restoration  update  is  then  given  by 

of+l)  =  -  FT(ef))(F(df))FT(ef)))-1f(ef))  +  c  (51) 

4.3  Implementation  of  the  Constrained  Scoring  Algorithm 

The  CSA  is  in  the  class  of  null-space  algorithms,  exploiting  U(6)  directly.  Instead  of 
employing  the  Hessian  or  the  FIM  as  done  in  Quasi-Newton  methods,  we  employ  the 
singular  CCRB  matrix  to  minimize  the  constrained  score  function.9  In  fact,  given  the 
unconstrained  FIM  1(0),  equation  46  or  50  provides  a  simple  algorithm  to  verify  CML 
performance.  Otherwise,  given  a  good  initialization,  equation  50  provides  a  method  to 
obtain  CML  performance.  This  can  be  considered  fine-tuning  the  estimate  that  provides 
the  initialization. 

If  the  quadratic  negative  log-likelihood  model  holds,  and  the  constraints  are  linear,  then 
when  initialized  with  a  feasible  point,  the  CSA  solves  the  CMLE  in  a  single  step.  This  is 
shown  in  detail  in  section  6.1. 

9The  CCRB  is  only  positive  semidefinite,  not  positive  definite,  since  the  null  space  matrix  U(0)  can  never 
be  full  row  rank  when  constraints  are  applied. 
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The  components  of  the  CSA  (see  figure  1)  include  the  restoration  steps  in  equation  51  (if 
the  projection  is  not  simple),  the  matrix  inverse  component  of  the  CCRB  in  equation  50, 
and  the  evaluation  of  an  expectation  (an  integration)  in  the  FIM.  To  reduce  the  complexity 
(see  table  1)  of  these  tasks  we  mention  some  variations  of  equation  50. 


Given  the  observations  x , 

1. 

Choose  an  initialization,  typically  Q\  =  ^(^mie), 
i.e.,  the  projection  of  the  MLE  onto  ©. 

2. 

Evaluate  the  decrement  A($i). 

(54) 

3. 

While  A (Qk)  >  e,  for  some  threshold  e, 
a.  Evaluate  the  gradient  Vglog p(x-,6k). 

b.  Evaluate  the  CCRB  B(6k). 

c.  Choose  ak  based  on  some  rule  so  that 

(11) 

Gk+ 1  =  7T [0k  +  akB(6k)X70logp(x ;  0k)\  is  usable. 

(50) 

d.  Reevaluate  the  decrement  A(0fc+1). 

(54) 

Figure  1.  The  constrained  scoring  algorithm. 


Table  1.  Complexity  of  the  CSA  per  iteration. 


Matrix 

Complexity 

U(G)  (given  F(0)) 
UT(6)J(G)U(G) 
(UT(0)J(0)U(O))-' 
U(G)(UT(0)J(G)U(G))~1UT(G) 
B(G)V elogp(x;  6) 

m 

<  AG 

N  J(N  +  J) 

J3 

NJ(N  +  J ) 

N 2 

N 

If  we  have  a  closed  form  expression  for  the  FIM  1(0),  this  eliminates  the  need  to  compute 
an  integral  for  each  step.  It  is,  however,  highly  unlikely  to  have  an  exact  formulation  for 
the  matrix  inverse  in  the  CCRB  except  in  trivial  or  simple  cases.  A  standard  procedure  in 
optimization  theory  for  gradient  methods  is  to  eliminate  the  need  to  compute  a  matrix 
inversion  for  every  step  by  reusing  the  initial  inverse  matrix.  This  adjustment  leads  to  the 
following  variant  of  the  CSA 

6k+l  =  Gk(ak)  =  7v[0k  +  akU(Gk)(UT(G1)I(6l)U(61))-1UT(Gk)Velogp(x;0k)].  (52) 

In  this  modified  CSA,  analogous  to  the  modified  Newton’s  method,  the  matrix  inversion 
needs  to  be  done  only  once,  for  the  initial  value.  This  procedure  also  requires  only  one 
evaluation  of  an  expectation  (or  integral)  to  obtain  the  initial  FIM.  Another  standard 
variation  is  to  apply  the  inversion  after  every  j  >  1  iterations. 
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If  the  projection  is  simple  or,  equivalently,  the  constraint  set  simply  defined,  e.g.,  planar, 
spherical,  or  boundaries,  then  the  restoration  step  is  easily  computed.  However,  if  the 
projection  is  not  easily  computed  and  an  iterative  scheme  as  in  equation  51  is  required,  the 
restoration  step  is  somewhat  tedious  and  may  be  time-consuming.  Alternatively,  the  CSA 
might  be  adjusted  by  only  restoring  occasionally.  The  procedure  would  then  be  to  obtain 
via  equation  46  the  ith  iterate  and  then  successively  restore  the  iterate  to  the  constraint  set 
via  equation  51,  and  then  to  repeat  the  procedure. 

Until  now,  we  have  also  always  assumed  that  the  projection  has  been  onto  a  convex  set. 
This  condition  guarantees  that  each  projection  in  the  restoration  step  is  unique;  this  is 
shown  via  the  Projection  Theorem  (18,22).  However,  note  that  this  condition  was  not  used 
in  section  3,  nor  was  convexity  used  in  the  development  of  the  projection  step.  In  either 
case,  only  connectivity  was  required.  This  is  important  to  point  out,  since  every  region 
defined  by  a  nonlinear  equality  constraint  is  nonconvex.  However,  these  nonlinear  equality 
constraints  are  the  ones  of  practical  interest.  So  if  we  simply  let  ©  be  a  connected  set,  we 
can  make  the  following  enhancement  in  the  definition  of  tt  [•]  in  equation  50.  Let 
7T  :  — >  ©  be  the  natural  projection  of  MA-space  onto  ©  with  the  minimal  distance.  For 

practical  sets,  e.g.,  a  sphere  in  lw,  the  projection  tt [•]  is  almost  everywhere  unique. 

Another  implementation  issue  is  when  to  stop  the  iterations.  An  obvious  choice  is  to 
measure  the  change  in  the  likelihood  after  each  iteration.  For  Newton’s  method,  another 
stopping  criteria  is  the  Newton  decrement  (18), 

\(0)  =  (Vg  logp(ay  0)(X7glogp(x]  d))^1'Vglogp(x;  0))1/2.  (53) 


This  is  a  norm  with  respect  to  the  Hessian  of  the  objective.  Note  A  (Ok)  decays  as  the 
iterations  converge  to  a  stationary  point.  The  corresponding  decrement  for  constrained 
scoring  is 

A (0)  =  (V£  log p(x-  6)B(0)Ve  log p(x;  0))1'2.  (54) 

This  scoring  decrement  is  shown  to  be  0  for  stationary  points  in  section  5.  By  a  continuity 
argument,  it  follows  that  the  iterations  are  close  to  the  stationary  point  when  A  (Of.)  is 
sufficiently  small. 

In  the  next  section,  we  show  that  the  CSA  at  least  converges  to  a  local  maximum  of  the 
likelihood.  The  only  requirement  is  that  UT  (0)1  (0)U  (0)  be  positive  definite.  However,  this 
is  also  a  requirement  for  the  existence  of  the  CCRB  (5). 


19 


5.  Convergence  Properties 

In  this  section,  we  examine  convergence  properties  of  the  constrained  scoring  algorithm 
motivated  by  the  properties  for  projected  gradient  presented  in  (23).  The  obvious 
statement  regarding  convergence  is  that  the  sequence  of  iterates  {Ok}  has  the  CMLE  0(x) 
as  its  limit  provided  the  initial  guess  is  sufficiently  close.  For  convenience,  we  will  choose  to 
analyze  only  the  convergence  properties  of  the  CSA  with  the  Armijo  step  size  rule, 
although  it  is  not  too  difficult  to  modify  these  proofs  for  another  rule.  By  construction,  we 
shall  show  the  algorithm  is  one  that  converges  to  a  local  stationary  point  if  there  is  indeed 
a  local  minimum. 

Let  {Ok}  be  any  sequence  generated  by  the  constrained  scoring  algorithm.  Thus,  0\  is  an 
arbitrarily  chosen  feasible  point,  and  successive  iterates  are  determined  by  the  CSA  in 
equation  50.  (Of  course,  0\  would  not  be  chosen  so  randomly,  but  we  ignore  this  aspect  of 
the  convergence  for  the  moment.)  Define  =  {0  G  ©| p(x]  0)  >  p(x ;  0fc)}.10 

Now,  by  definition  of  the  chosen  Armijo  rule  equation  49,  it  can  be  seen  that 

log p(x]0k)  <  \ogp(x;Ok+1) 


for  every  positive  integer  k.  Thus,  we  have  the  following  fact  about  {0k}: 

Property  1.  The  sequence  {—  logp(cc;  0k)}  is  a  monotone  decreasing  sequence. 
Furthermore,  if  —  \ogp(x\  0)  is  bounded  below  then  {—log p(x;Ok)}  converges. 

Thus,  given  an  iterate  0kl  all  succesive  iterates  are  contained  in  Xefc,  i.e.,  Oj  G  for  all 
j  >  k.  The  second  statement  is  simply  the  monotone  convergence  principle  from  analysis. 
And,  by  the  monotonicity  of  —  log(-),  then  {p(x\  0k )}  is  also  a  convergent  sequence  when 
{—  \ogp(x;  0k)}  is.  We  will  assume  that  the  likelihood  is  bounded  above,  so  that  this 
sequence  always  converges.  This  leads  to  the  following  result: 

Property  2.  The  sequence  {||#a,-+i  —  0k ||}  vanishes  as  k  — >  oo. 

Proof.  Again  from  equation  49  we  have  that  ||0fc+1  —  0k\\2  is  bounded  by  the  product  of  an 
iterate  from  a  sequence  bounded  below  and  logp(x,  0k+\ )  —  log p(x\  0k).  Since 
{logp(:r;  0k)}  converges,  the  bound  is  made  arbitrarily  small  for  sufficiently  large  k.  Thus, 
the  same  is  true  for  \\0k+i  —  0k ||.  j  [ 

10In  this  section,  we  will  assume  0  is  convex  so  that  7r [■]  is  the  natural  projection. 
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This  does  not  imply  that  the  sequence  {0k}  converges.  However,  if  is  bounded,  then 
the  Bolzano- Weierstrass  theorem  (19)  implies  the  existence  of  a  cluster,  or  accumulation, 
point,  i.e.,  the  existence  of  a  subsequence  that  does  converge. 

Property  3.  If  is  compact,  then  cluster  points  of  the  sequence  {0k}  are  also 
stationary  points. 

Recall,  a  point  0  is  stationary  if  it  satisfies  the  KKT  condition  equations  20  through  24. 
The  initial  iterate  is  feasible  and  the  restoration  tt  [•]  preserves  feasibility.  Since  we’ve 
incorporated  the  active  constraints  into  f,  the  only  additional  condition  is  equation  25.  So 
if  0*  is  a  stationary  point,  then 

Velog  p(x;0*)U(O*)  =  n*T  F  (0*)U  (0* )  =0. 

Hence,  B(0) X7e  logp(x;  0*)  =  U (0)(UT (0)I(0)U (0)ylUT (0) Ve  logp(:z;  0*)  =  0.  Thus,  a 
point  is  stationary  only  if 

0* (a)  =  7 t[0*  +  aB(0)Ve  log p(x\ 0*)}  =  7 t(0*)  =  0*. 

By  definition  of  the  Armijo  rule,  we  also  have  a  =  0  at  stationary  points.  While  this  step  is 
necessary  in  the  method  of  gradient  projection,  it  is  not  so  here  as  can  be  seen  above,  since 
the  end  result  would  hold  regardless. 

Proof  of  Proposition  3.  Let  0*  be  a  cluster  point  of  the  sequence  0k-  Then  there  exists  a 
convergent  subsequence  0rlk  which  has  0*  as  its  limit.  Note,  by  the  triangle  inequality, 

\\tt[0*  +  afU(0*)(UT(0*)I(0*)U(0*))-lUT(0*)Ve\ogp(x-0*))  -0*|| 

<  \\n[0* +a*B(0*)\7elogp(x;0*)\  -  0*\\ 

<  ||t t[0*  +  a* B(0*)\7 e  log p(x\  0*)}  -  n [0nk  +  ankB(0nk)We  log p(x;  0nk)} 

+7T [0nk  +  OinkB(0nk)V o  log p(x\  0n J{  -  0nk  +  0„fc  -  0*|| 

<  \\ir[0*  +  a* B(0*)V elogp(x\  0*)\  -  n[0rik  +  ankB(0nk)V e  logp(cc;  0nk)\  || 

+  I \n[0nk  +  ankB(0nk)V elogp(x ;  0nk)\  -  0nk\\  +  || 0nk  -  0*\\  .  (55) 


Since  the  projection  7r  is  onto  a  convex  set  ©,  the  distance  between  two  points  in  lw  is  at 
least  as  great  as  the  distance  between  the  projections  of  those  two  points.  Thus,  we  have 

||t t[0*  +  a*  B(0*)\7  elogp(x\  0*)\  -  n[0nk  +  ankB(0nk)Ve  logp(x;  0nk)\  || 

<  \\0*  +  a*B(0*)Velogp(x\0*)  -  0nk  -  ankB(Onk)Velogp(x;0nk)\\  (56) 
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Thus,  applying  the  inequality  of  equation  56  to  55  we  have 


\\n[G*  +a*U(9*){UT(6*)I(0*)U{0*))-1UT{0*)\7elogp{x;0*)\  -0*|| 

<  ||0*  +  a* B(G*)W e  log p(x;G*)  -  G„k  +  ankB(0nk)Ve  logp(cc;  0nJ|| 

+  hl0nk  +  ankB(Gnk)Ve  log p(x;  0„J]  -  0nJ  -  ||0«fc  -  0*|| 

<  ||0*  «-0nfc||  +  \\a*B(6*)V 0logp(x;  G*)  -  ank B(Gnk)Ve logp(cc;  0„J|| 

+  ll7r(0nfc  +  a„fcB(0njVelogp(a;;0nJ]  -  0nJ  +  ||0«fc  -0*11  (57) 

—  2  ||0nfc  —  0  ||  +  ||0nfc+l  —  0nfc|| 

+  \\a* B(G*)X7 0 log p(x]  G*)  -  ankB(Gnk)Vg\ogp(x]  Gnk)\\  (58) 

where  the  triangle  inequality  is  applied  to  obtain  equation  57,  and  the  CSA  iteration  in 
equation  50  is  used  to  obtain  equation  58.  Taking  the  second  term  in  equation  58,  and 
applying  the  triangle  inequality  yet  again,  the  last  term  of  equation  58  is  bounded  as 

|| a*B(G*)X7e  logp(:r;  G*)  -  ankB(Gnk)V0  logp(x;  0nJ|| 

<  ||a*B(0*)V*  logp(*;  G*)  -  a*B(G* ) V0  logp{x;  0nJ 

+  Q!*B(0*)Velogp(*;0nik)  -  ank B(Gnk)V  e\ogp(x;  0nJ|| 

<  \\a*B{G*)X7o\ogp(x-,G*)-a*B{G*)Velogp{x-,Gnk)\\ 

+  \\a* B(G*)V 0logp(x;  Gnk)  -  ankB{Gnk)V  0\ogp(x]  0nJ|| 

<  ||a*B(0*)(Ve  logp(*;  G*)  -  V0  logp^;  Gn J)  || 

+  || (a*B(G*)  -  ankB{Gnk))V e\ogp{x]Gnk)\\  .  (59) 

In  Euclidean  space  (or  any  norrned  space),  we  have  the  property  that  ||iVfu||  <  \\M\\  ■  ||u|| 
for  any  matrix  M  and  vector  v  ( 24  )•  Thus  equation  59  becomes 

\\a*B(G*)V0  log p(x-,  G*)  -  ankB(Onk)V0  logp(x;  0nJ|| 

<  I|q!*B(0*)||  ||V0logp(:r;0*)  -  V0logp(®;0nJ)|| 

+  || (a*B(G*)  -  ankB(Gnk))\\  ||V0logp(:r;  0nJ|| .  (60) 

Substituting  equation  60  into  58,  we  have 

\\ir[G*  +  a*U(G*)(UT (G*)I(G*)U(G*))~lUT (G*)V e\ogp(x-G*)\  -0*|| 

<  2  ||0nfc  -  0*||  +  ||a*B(0*)||  ||Velogp(!E;0*)  -  Velogp(*;0nfc))|| 

+  \\(a*B{G*)  -  ankB{Gnk))\\  ||V0logp(*;  0nJ||  +  ||0„fc+i  -  0„J|(61) 

Since  £01  is  compact,  then  ||0nJ|  and  ||  V0  logp(cc;  0nfe)||  are  bounded.  Note  the  inequality 
holds  for  all  nfc,  so  let  — >  oo.  Then  we  have  that  Gnk  — >  G*,  and  by  continuity, 
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Vfllog p(x]0nk)  — >  Vglog p(x]G*)  and  ankB(6rik )  — >  a*B(G*).  Now  the  first  term  of 
equation  61  vanishes  by  Property  2.  Thus,  the  right  side  of  equation  61  can  be  made 
arbitrarily  small  and,  therefore, 

7 r[0*  +  a*U(G*)(UT(G*)I(G*)U(G*)ylUT(G*)Ve  log p(x;  G*)\  =  G*, 

i.e. ,  G*  is  stationary.  The  statement  holds  true  as  well  if  is  compact  for  some  positive 
integer  n.  □ 

Property  4.  If  Tlg1  is  compact  for  all  sequences  in  a  set  ©  and  there  is  a  unique  cluster 
point  G0  for  all  such  sequences  then  hmfc_,oc  Gk  =  G0  for  every  sequence  {Gk}.  Also,  G0  is 
the  minimum  of  —  logp(:r;  G). 

Essentially,  if  only  one  accumulation  point  exists  for  all  such  sequences,  it  must  be  the 
CMLE,  i.e.,  lim^oo  6k  =  G(x). 


6.  The  Linear  Model  with  Constraints 

Suppose  we  have  a  vector  of  observations  x  of  a  parameter  vector  G  given  by  the  following 
linear  model 

x  =  JIG  +  n  (62) 

where  7i  is  a  known  observation  matrix  and  n  is  random  noise  from  J\f(0,C)  with  known 
covariance  C.  The  true  parameter  vector  will  be  Ga.  To  be  general,  we  will  assume  all 
parameters  are  complex- valued.  The  MLE  of  this  problem  is  well  known  (f,  pp.  528),  and 
also  happens  to  be  the  best  linear  unbiased  estimator  (BLUE)  and  the  minimum  variance 
unbiased  (MVU)  estimator: 

#(x)  =  {^e^ny^e^x.  (63) 

This  MLE  is  also  efficient,  i.e.,  it  has  a  covariance  matrix  equal  to  the  complex  CRB 
Cg  =  Ego(G  -  G0)(G  -  G0)h  =  (7iH C^Tiy1  =  T~\G0). 

where  1(G)  is  the  complex  Fisher  information  matrix  (f,  p.  529)  of  the  model.  Note  here 
that  the  log-likelihood  is  quadratic  with  respect  to  G ;  thus  the  Fisher  score  is11 

v,#  \ogp(x-,  g)  =  nTc~*(x*  -n*G*) 

11  There  are  numerous  definitions  of  the  complex  derivative.  We  define  it  to  be  ^  =  |(a^Ca^  ~  3  ^ 

for  complex-valued  a.  A  benefit  of  this  definition  is  that  numerous  results  are  preserved  for  strictly  real¬ 
valued  parameters. 
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and  the  Hessian  is 


V|logj o(x-d)  =  ~nHc  lU  =  -Z(tf). 


Hence,  the  FIM  and  CRB  are  constant  in  Thus,  for  convenience  in  this  section,  we  will 
simply  denote  the  FIM  as  X. 


The  complex  model  can  be  described  in  terms  of  real-valued  parameters  as  well.  Define 

T 

Q  =  [Re($)T,  Im(i9)T]  ,  then  the  real-valued  FIM  is  given  by 


1(d)  =  I 


'Re(Z)  — Im(Z)' 

_  o 

'Re^C"1?*) 

-Im  ('HhC~1H) 

Irri(Z)  Re(Z) 

—  z 

Im (UHC  xn) 

R e(nHC-lU)  _ 

In  which  case,  the  Fisher  score  is  now 


V0log  p(x\6) 


2 


Re  (HIIC 
Im  (■ nHc 


1 

1 


(x  —  'Hd)) 
(x  —  'H&)) 


The  Hessian,  likewise,  is  still  the  negative  FIM.  The  necessity  of  converting  the  FIM  and 
score  to  the  real  parameter  case  is  so  we  may  apply  the  CSA  in  the  following. 

6.1  Linear  Constraints 

Now,  suppose  that  linear  constraints  are  imposed  on  the  parameters,  i.e., 

©  =  {d  :  f(0)  =  X$  +  v  =  0}  (64) 


where  X  and  v  are  known.  We  assume  that  ©  is  nonempty.  Then  X~(d)  =  T  for  all  $.  If 
we  assume  that  ©  is  regular,  i.e.,  T  is  full  row  rank,  and  that  H  is  full  column  rank,  then 
the  CMLE  is  given  by  the  constrained  least  squares  estimator  (CLSE)  (4,  p ■  252) 

^clse(x)  =  V(x)-X-lXH  (XX-^ny1  (X#{x)  +  v)  (65) 

=  (x~l  -  X~xTh  (XX-^ny1  JFZ'1)  Xd(x) 

-X~lXH  (TX~xTh)  V  (66) 


The  Erst  term  of  this  last  equation  uses  the  Marzetta  derivation  of  the  CCRB  for  the 
expression  in  the  parenthesis,  albeit  in  complex  form.  Given  this  knowledge,  it  can  be 
shown  that  this  estimator  is  unbiased  and  efficient,  a  result  that  appears  to  be 
undocumented  in  the  literature,  even  for  the  real  parameter  case.12  The  second  term  in 

12For  completeness,  a  proof  of  the  unbiasedness  and  efficiency  of  equation  65  is  presented  in  appendix  C. 


24 


equation  66  is  a  specific  feasible  point  of  the  constraint  set.  However,  this  formulation 
requires  both  a  nonsingular  complex  FIM  X  (or,  equivalently,  a  full  column  rank  7i)  and  a 
full  row  rank  T ,  conditions  which  may  not  hold.  Using  the  CSA  to  circumvent  these 
conditions  essentially  turns  the  problem  into  a  standard  nullspace  quadratic  exercise  (13). 


First,  we  need  to  convert  the  problem  to  real-valued  parameters.  Note  that  equation  64  is 
equivalent  to  the  following  real-parameter  constraint  set 


© 


:  f  (9)  =  F9  +  v  =  0,  where  F 


Re(JF) 

lui(J-) 


— Im  (F) 
Re(jF) 


Re(r/)  1 
lm(v)  j 


(67) 


Let  U(G)  =  U  be  the  matrix  defined  by  equation  5.  It  can  be  shown  that  if  14(9)  =  14  is 
defined  to  be  the  matrix  whose  columns  form  an  orthonormal  basis  of  F  so  that  F14  =  0 
and  14HU.  =  Ijxj,  then 


U 


R  e(U)  -Im  (U) 
Im  (U)  R e(U) 


(68) 


Let  9 1  be  any  vector  satisfying  the  constraints  so  that  9 1  G  ©.  For  example,  if  the  space  is 
still  regular  (F  is  full  row  rank),  then  9\  =  —Fn(FFH)~1v  is  such  a  point  vector, 
otherwise  let  9\  =  —F^v\  however,  note  that  any  feasible  point  vector  will  work.  Note, 

Q\  =  [Re('j91)T,  Im('j91)T]T  is  the  corresponding  point  in  ©  .  As  stated  earlier,  since  the 
log-likelihood  is  quadratic  and  the  constraints  linear,  the  equations  40  and  41  are  exact,  so 
the  restoration  step  and  the  variable  step  size  in  the  CSA  equation  50  are  unnecessary,  i.e., 
the  CMLE  is  found  in  one  step  to  be  the  linear  estimator  given  by 


o(x)  =  e1  +  B(d1)ve\ogp(x-1e1) 

=  6\  +  U(UT  1(9 i)U)~1UT'V g  log p(x;  6i) 


Re^r)' 

1 

"R  e(BUHC~l(x-l-C9l)) 

Irn  (R  ( ) 

1 

lm(BUHC~1(x-'H9i))_ 

=>9(x)  =  91+BHhC-1(x-H91) 


where  £  =  13(9)  =  U(UTX(9)U)-lUT  is  defined  similarly  as  in  equation  11  to  be  the 
complex  CCRB  for  this  constrained  problem.13  Comparing  this  CMLE  result  with  the  prior 
CLSE  result  in  equation  65,  we  see  the  only  requirement  now  is  that  7il4  be  full  column 
rank,  whereas  equation  65  required  both  7i  to  be  full  column  rank  and  T  to  be  full  row 
rank.  In  other  words,  the  prior  solution  requires  information-regularity  and  a  regular  point 
solution  in  the  original  problem;  the  solution  given  here  only  requires  the  alternative 
information- regularity  condition  (see  Theorem  1  or  (<?)).  This  leads  to  the  following  result. 

13It  should  be  emphasized  that  the  complex  CCRB  is  of  the  form  given  by  15  only  for  this  particular  choice 
of  constraint.  For  general  constraints,  the  covariance  matrix  of  the  real-parameter  estimator  may  not  assume 
the  necessary  form  (^,  pp.  524-532). 
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Theorem  3.  If  the  observations  x  are  described  by  a  linear  model  of  the  form 

x  =  Till  +  n 

where  71  is  a  known  matrix,  D  is  an  unknown  parameter  subject  to  the  linear  constraint 
/($)  =  Tfi  +  v  =  0  with  the  true  parameter  being  i90,  and  n  is  a  noise  vector  with  PDF 
J\f(Q,C),  then  provided  7IU  is  full  column  rank  where  U  is  given  by  equation  5,  the  CMLE 
of  I}0  is  given  by 

■&{x)  =  tf1  +  B7iHC-1{x-7itf1)  (69) 

where  B  is  the  CCRB  and  is  any  arbitrary  feasible  point  (e.g.,  =  T^v).  4}(x)  is 

unbiased  and  is  an  efficient  estimator  which  attains  the  CCRB  and,  therefore,  is  the  BLUE 
and  the  MVU  estimator.  Furthermore,  when  71  is  full  column  rank  and  T  is  full  row  rank, 
then  &(x)  =  RClse(x). 

Proof.  First  note  that  since  TB  =  0,  the  estimator  satisfies  the  constraints: 

ffQlxf)  =T{V1  +  B7iHC~1(x  -  W))  +u  =  J=D\  +  v  =  ffdf)  =  0. 

Next,  from  equation  41  R0  —  Di  =  Ur]  for  some  rj  E  CJ  since  by  a  Taylor  expansion 
0  =  f(&o)  =  F-(0o-0i).  Hence,  we  have  the  following  convenient  property: 

B'7LTC-1H{'d0-R1)  =  U(UH7iHC-l7iUy1UH7IHC-17iUr] 

=  Ur] 

=  Ro-'&i- 

Thus,  the  CMLE  is  also  unbiased: 

E#J(x)  =  ^l  +  B7iHC-1E0o(x-7i^1) 

=  R1  +  B7iHC-1H(^0-R1) 

=  tf1+'d0-'d1 
= 
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and  the  estimator  is  efficient,  i.e.,  its  covariance  matrix  is  the  CCRB: 

~  EoAx))(0(x)  -  E^{x))H 
=  E^(x)^H(x)-{)0^ 

=  £*o[(0 1  +  BUhC-\x  -  W))(i?i  +  BHhC-\x  -  n^))H]  - 
=  Eo0[{#i  +  BnHC-\n  +  7i(V0  -  <?i))(^i  +  BHHC~\n  +  7*(0O  -  ^))H]  - 
=  +  <Q^0  -  0 x)hUhC-xUB  +  BUHC-xE^0nnHC-xUB 

+bhhc~1u{v0  -  0i)0f  +  buhc~xu(^0  -  <?i  )(i?0  -  ^x)HnHc 1tlb  - 

=  -  tf,)"  +  (0O  -  +  (0O  -  t?i)(t?0  -  0i)*  - 

+BHHC~1CC-1?tB 
=  BrLHc~lrcB 

=  U(1Ah'HhC-x'HU)~1Uh  =  U(UHXU)~lUH  =  B. 

So  this  more  general  CMLE  is  also  the  BLUE  as  well  as  the  MVU  estimator  in  the  general 
linear  model  under  the  linear  constraint.  To  see  equivalence  to  the  CLSE  expression 
equation  65,  assume  that  the  FIM  X  is  nonsingular  and  the  constraint  space  ©  is  regular. 
Then,  B  =  X  1  —  X~l T11  {TX~x T11)1  TX~x ,  i.e.,  the  Stoica  CCRB  formulation  and  the 
Marzetta  formulation  are  identical  (5),  and  so 

0(a?)  =  ^1  +  BUHC-1(x-n^1) 

=  BUhC  lx  -  ( buhc-xu  -  INxN)tf  i 
=  BXV{x)  -  (BX  -  INxN)Vi 

=  BXfi(x)  +X-lTH(TX~lTH)-1Tfil 

=  BX${x)-X-xTh{TX-xTh)-xv 

=  (X”1  -  x~1xH[xx~1xH)-lxx~1)X'd(x)  -  x-' xH {xx-] xHy} v 

=  0(a?)  -  X~x  TH  (TX~x  TH)~X  (T  ^(x)  +  v) 

=  0 clse(x ). 


□ 


In  addition  to  being  an  alternative  CMLE,  equation  69  is  also  an  alternative  CLSE. 

6.2  Nonlinear  Constraints 

As  an  example  of  nonlinear  parametric  constraints  imposed  on  equation  62,  we  consider  the 
constraint  set  ©  =  {0  :  =  |$j|'  —  1  =  0,  i  =  1,  2, . . . ,  N}  where  all  the  elements  of  $ 

are  restricted  to  be  of  unit  modulus.  This  was  one  of  the  constraints  applied  in  (12)  in  the 
evaluation  of  performance  bounds  of  a  different  model.  It  should  be  noted  that  this  set  is 
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not  convex,  but  natural  projections  from  M2JV  onto  ©  are  a.e.  unique,  i.e. ,  unique  except  on 
a  set  {0  :  =  0  for  any  i\  of  measure  zero.  The  gradient  matrix  of  the  constraints  is  then 

F(0)  =  2  [Re(T(0))  Im(T(t?))] 


where  T(i9)  =  diag($),  i.e.,  a  diagonal  matrix  with  ith  row-column  element  d,: •  A  matrix 
whose  columns  form  an  orthonormal  null  space  of  F(0)  was  found  in  (12)  to  be 


U(6) 


Im(T(d)) 
-R  e(T(d)) 


This  results  in  the  following  CCRB: 


B(G)  =  U  (Q)(U  T(Q)I  (Q)U  (0))~lUT  (0) 


Im(T(d))  ' 

— R  eOT 


•  (R e(TH(G)7iHC-ll-LT(G)))  1  ■  [im (T(0))  -Re(T(0))]  . 


Thus  the  CSA  is  given  by 


@k+i  —  tv  [0k  +  akB(0k)X7 e  log p(x-,  0k) 

Im(T(0fc)) 


=  7T 


+  Oik 


X 


— Re(T(dfc)) 

(Re(T//(dfc)?iflC"1?iT(dfc)))"i  ■  Im  (TH(#k)nHC~\x  -  ?t#k)) 


where  tv  is  the  projection  onto  ©  and  the  relation  6k 


[Re(dfc)T,  Im ('&k)T  T  holds  for  all  k. 


For  simulation,  we  randomly  selected  the  complex  elements  for  an  8x8  observation  matrix 
7i  from  the  standard  normally  distributed  number  generator  provided  in  MATLAB,  and 
randomly  generated  unit  modulus  elements  for  the  constrained  0  vector.  We  consider  the 
average  performance  of  the  C ~ 1  -norm  of  (x  —  Till)  and  average  performance  of  the 
mean-square  error  (MSE)  of  0  over  n  =  5000  realizations  of  the  noise  n  modeled  as 
spatially  white;  so  C  =  cr2I8xs  with  a2  =  A.  in  the  CSA  we  employ  the  diminishing  step 
size  rule  with  ak  =  p  (All  steps  were  usable.)  For  comparison  we  temporarily  ignored  the 
stopping  criteria  by  choosing  an  infinitesimal  bound  requirement  on  the  decrement.  A 
reasonable,  close  initial  estimate  of  the  CMLE  is  the  projection  of  the  MLE  equation  63 
onto  ©,  i.e., 


$1  =  Tv[d(x)\  =  TV 


(■ nHc  ln )  1nHc^1x 


This  initial  value,  as  one  would  expect,  turns  out  in  general  not  to  be  the  CMLE.  However, 
the  estimate  is  often  (but  not  always)  a  very  good  initialization,  as  revealed  in  figure  2. 

The  norm  error,  given  by  ^  Xu=i  11**  —  vs.  the  iterate  is  nearly  identical  for  any 
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Figure  2.  The  average  norm  of  x  —  'H.’dk  at  iteration  k.  There  is  significant  gain  in 
the  first  iterate  compared  with  later  iterates,  as  expected  with  a  quadratically 
convergent  sequence. 

particular  realization.  This  metric  is  key,  since  seeking  to  minimize  the  negative 

log-likelihood  of  the  linear  model  equation  62  is  equivalent  to  minimizing  the  norm  of 

(. x  —  Ti'd).  So  we  see  that  the  CSA  does  indeed  maximize,  at  least  locally,  the 

log-likelihood.  A  random  local  search  could  not  provide  a  better  global  maximum,  but  does 

make  evident  the  need  for  a  sufficiently  close  initialization  to  the  CMLE  for  the  CSA  (or 

any  Newton-type  algorithm)  to  successfully  work  in  this  example.  As  is  evident,  since  the 

negative  log-likelihood  is  quadratic  and  the  constraints  nearly  linear  (in  a  neighborhood  of 

the  CMLE)  there  is  significant  gain  after  merely  the  first  one  or  two  iterations  as 

convergence  is  nearly  quadratic.  Figure  3  compares  the  average  MSE  (per  real  parameter 

,  2 

coefficient)  at  iteration  k  to  the  CCRB  given  by  A—  Xu=i  $ k(xi )  —  ;  as  can  be  seen, 

the  CMLE  for  this  scenario  achieves  the  CCRB.  The  numerical  simulations  also  show  this 
particular  CMLE  to  be  asymptotically  unbiased  in  figure  4.  We  calculated  the  average  of 
the  absolute  bias  of  the  parameters,  i.e., 

^16  \  n 

average  bias  =  —  &(xi)j  ~  0j  , 

j= 1  i=  1 

where  6(Xi)j  and  Qj  refer  to  the  jth  element  of  the  vector  (and  not  the  iteration  in  this 
case).  An  average  error  of  .005  in  both  axes  of  the  Cartesian  plane  roughly  corresponds  to 
an  average  of  less  than  a  half-degree  of  error  in  the  phases  of  each  element  of  $. 
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Average  mean-square  error  aner  neranon  k  (per  real  coerricieni; 


Figure  3.  Average 


Figure  4.  Average  bias  error  of  the  CSA. 


A  Nonlinear  Model  Example 


By  a  nonlinear  model,  we  mean  a  model  which  is  nonlinear  with  respect  to  the  parameter 
vector  $  or  6.  Unlike  in  the  previous  section,  there  is  no  inherent  guarantee  that 
maximizing  the  likelihood  also  minimizes  the  MSE  of  the  parameter  vector.  It  depends  on 
the  model.  However,  we  still  have  the  property  that  if  an  efficient  estimator  exists  for  the 
constrained  problem  then  it  must  be  the  CMLE.  Recall  the  condition  for  equality  with  the 
CCRB  in  Theorem  1: 

0(x)  -0  =  B(9)\7elogp(x-,G).  (70) 

This  implies  that 

0  =  0(x)  —  0(x)  =  B(0(x))V  e  logp(x;  0( x)).  (71) 

When  Ur (0(x))I(0(x))U(0(x))  is  positive  semidehnite,  this  implies  that 

Vglog p(x;0(x))  G  span  ( FT(6(x )),  which  satisfies  the  Lagrangian  equation  34.  Thus,  as 

with  the  ML  method,  the  CML  method  also  produces  an  efficient  estimator  if  it  exists. 

In  this  section  we  examine  a  scenario  given  in  (12). 

7.1  MIMO  Instantaneous  Mixing  Model 

Consider  the  multi-input,  multi-output  (MIMO)  instantaneous  mixing  model  where  xn  is  a 
vector  of  observations  of  a  linear  mixing  of  unknown  parameters  at  time  n  —  1, ...  ,N  given 
by  the  model 

xn  =  T-Csn  +  n  (72) 

where  7i  is  an  unknown  M  x  K  complex-valued  channel  matrix,  each  sn  is  a 
complex-valued  data  symbol  vector,  and  the  additive  noise  vector  n  is  spatially  and 
temporally  white  with  variance  a2.  We  define  a  vector  of  unknown  parameters  by 

'/iW 

si1) 

0=  ;  , 

/iw 

s(r) 

where  is  the  kth  column  of  7t  and  =  [si(fc), . .  . ,  Sjv(^)]-  Then  the  complex  FIM 
was  found  in  (25)  to  be 

T(#)  =  ^QhQ 
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where  Q  =  [Q{1\ . . . ,  Q(A)]  with  Q(k)  =  [ IMxm  <8>  ®  /Mxm]  •  As  detailed  in  (25), 

this  FIM  is  singular  due  to  the  multiplicative  ambiguity  inherent  in  the  model.  As  before, 
the  model  can  be  described  in  terms  of  the  real-valued  parameter  vector 
6  =  [Re('*9)i  Im($)T]  ,  which  has  a  real-valued  FIM  given  by 


m  =  2 


Re(J(0))  -Im(J(0))' 

4 

R  e(QHQ) 

-Im  (QHQ) 

Im  (J(0))  Re(J(0)) 

Im  (QHQ) 

R e(QHQ)  _ 

By  the  structure  of  this  matrix  nullity(J(0))  =  2  ■  nullity(«7($)),  so  information-regularity 
cannot  be  achieved  without  constraints.  Sadler,  et  ah,  applied  constant  modulus  and 
semi-blind  constraints  on  equation  72  to  obtain  a  locally  identifiable  model  (12).  The 
constraint  set  is 


©  =  {0  :  fk,n(0)  =  \sn(k) |2  -  1  =  0 ,Vk,n;ft(0)  =  st(k)  -  stj.  =  0,  Wk,  t  =  1, . . . ,  T) 

where  the  st„k  are  known.  For  this  constraint  set,  the  gradient  matrix  of  the  constraints 
was  found  in  (7)  to  be 

F(0)  =  [Re(^(0))  Im(^W)] 


where 


>(1)(s(1),ft«) 

and  T{k)(s^k\hik))  =  2 

/axT  0  0 

jJ-TxT 

£>(s«)  0 

with  I»(sW)  defined  as  in  section  6.2.  (Note  that  this  construction  of  the  gradient  matrix 
F(0)  is  not  regular  since  ©  contains  redundancy  in  restricting  some  signal  values  to  be 
unit  modulus  as  well  as  known  values.  ©  might  be  reformulated  so  the  points  are  regular, 
but  this  is  unnecessary.)  A  matrix  whose  columns  form  an  orthonormal  null  space  of  F(0) 
is  given  by 


U(6) 


Im  (W(0)) 
-Re(W(0)) 


where 


M(0) 


UW(8W,hW) 


U(K\s(K\h^) 


and 


U{k\s^\h^)  = 


0 

T>t(s^) 


-ImxM  JImxM 

0  0 
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where  T>t(s^)  is  the  matrix  'D(s^)  with  the  first  T  rows  removed.  This  results  in  the 
following  CCRB: 


B{6) 


U{6){U 1  (9)I(9)U(9))~  U  (9) 


a 


Im (14(9))  ' 
-Re(W(0)) 


•  (R  e(UH(9)QHQ,U(9)))  1  ■  [lm(W(0)) 


-Re(W(0))]  . 


Note  the  similarity  in  the  structure  of  the  CCRB  with  that  in  section  6.2.  This  is, 
generally,  the  form  of  the  CCRB  when  the  appropriate  14(9)  matrix  can  be  found.  Note 
that  here  14(9)  is  not  the  null  space  of  T( 9 ),  but  rather  these  matrices  are  convenient 
expressions  to  formulate  U(0)  and  F(0),  respectively. 


We  simulated  a  K  =  2  source,  M  =  2  channel  model  over  N  —  30  time  samples.  The 
channel  Ti  is  taken  from  the  three-ray  multipath  case  in  (12)  with  directions-of-arrival 
(DOAs)  {—1,0,4}  and  corresponding  complex  amplitudes  {\Ah2Z1}?,  v^F5,  V0.15ZZ/}  for 
source  1  and  DOAs  {0,  5, 11}  and  amplitudes  {a/0.15 Zz4,  \/o76,  \/0.25Z|}  for  source  2. 
The  source  elements  were  taken  randomly  from  an  8-PSK  alphabet.  The  constraints  are 
the  modulus  constraint  as  well  as  knowledge  of  the  first  T  =  2  symbols  for  each  source 
discussed  above.  (FIM-regularity  requires  at  least  T  =  1  training  samples  per  source.) 
Source  1  (i.e. ,  s^))  is  normalized  to  have  an  SNR  of  20  dB  and  the  SNR  of  source  2  is  set 
at  15  dB.  We  can  scale  the  channel  to  reflect  these  signal  powers,  i.e.,  for  i  —  1,  2 


SNR,; 


Ma2  ’ 


which  allows  us  to  set  the  noise  covariance  to  be  o2  =  1.  We  obtain  an  initialization  via  the 
zero-forcing  variant  of  the  analytical  constant  modulus  algorithm  (ZF-ACMA)  found  in 
(26).  This  estimate  is  projected  onto  the  constraint  set  ©  and  then  we  apply  the  CSA 
using  a  successive  step  size  scheme  (a*  =  for  the  smallest  positve  integer  m  that  results 
in  a  usable  step).  We  calculate  the  average  MSE  (per  real  coefficient)  at  each  iteration  over 
n  =  2000  trials  and  compare  with  the  mean  CRB  for  each  of  the  sources  and  channels. 

The  numerical  simulations  show  the  improvement  in  the  average  MSE  of  both  the  signals 
and  the  channels  (figure  5),  on  average  halving  the  MSE  of  the  initialization  estimates 
provided  by  ACMA  in  this  instance.  Note  that  after  a  certain  number  of  iterations,  the 
MSE  for  both  the  channel  and  source  elements  are  actually  slightly  increasing.  This  occurs 
when  the  decrement  A (9k)  is  sufficiently  close  to  0  (see  figure  6).  Setting  the  stopping 
criteria  to  be,  e.g.,  A (0k)  <  1,  achieves  the  least  MSE  from  the  CSA  iterations.  We  note 
that  generally  there  is  greater  improvement  for  the  channel  estimates  over  the  signal 
estimates.  Also,  we  note  that  the  CSA  does  not  appear  to  reduce  the  MSE  significantly  for 
the  T  =  1  case  or  for  low  SNR  values. 
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Figure  6.  The  average  decrement  at  each  iteration. 
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8.  Conclusions 


We  determined  the  asymptotic  normality  properties  of  a  parametrically  constrained 
maximum-likelihood  problem.  In  doing  so,  we  have  shown  that  this  CMLE  is  consistent 
and  asymptotically  efficient.  We  then  derived  a  generalization  of  the  method  of  scoring,  the 
Constrained  Scoring  Algorithm,  using  Langrangian  optimization  methods.  The  CSA 
maintains  certain  desirable  convergence  properties  and  proofs  of  these  have  been  offered. 
Finally,  we  examined  several  problems  of  interest:  the  classical  linear  model  and  an 
instantaneous  linear  mixing  model,  to  verify  the  usefulness  of  this  approach.  For  the  linear 
model,  we  found  another  CMLE  which  we  have  shown  to  be  unbiased  and  efficient. 
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A.  Equivalence  of  Optimality  Conditions 


This  appendix  shows  the  equivalence  between  Bertsekas’  criteria  of  optimality  locally  in  a 
convex  set  (22,27)  and  the  method  of  Lagrange  multipliers.  Bertsekas  defines  0*  G  ©  to  be 
optimal  in  the  convex  set  ©  provided 

V<9  logp(a;;  0*)(0*)(0  —  0*)  <  0  (A.l) 


for  all  0  6  0.  The  Lagrange  method  is  optimal  given  the  KKT  conditions  in  equations  20 
through  24. 

Assume  equation  A.l.  Then  either 


(a)  0*  G  int©  (the  interior  of  ©),  or 

(b)  0*  G  <9©  (the  boundary). 


First,  suppose  (a)  0*  G  int©.  Then  0*  is  in  an  open  set  O  C  ©.  Let  0  be  a  point 
sufficiently  close  to  0*  so  that  both  0  and  0  =  0*  —  (O'  —  6*)  are  in  O.  Note  that  0  is 
the  point  vector  exactly  oppposite  0  from  6*.  Since  6*  is  optimal,  then 

V0  logp(x;  0*)(0*)(0  -  0*)  <  0. 


for  0  =  0  and  0  =  0.  But  also 

VTe  \ogp(x-  0*)(0"  -  0*)  =  V£  log p(x-  0*)  ■  -(O'  -  0*)  >  0. 


So,  log p(x]  0*)(O  —  0*)  =  0  for  every  0  G  O  since  the  choice  of  0  was  arbitrary.  And 

since  O  is  open  the  dimension  of  {0  —  0*  :  0  G  O}  is  the  dimension  of  ©.  Thus,  we  must 
have  that  V©  logp(ay  0)  =  0.  Simply  choose  pi*  =  0  and  is*  =  0,  then  the  equation  in  (24) 
is  satisfied  by 

V0  logp(x-  0*)  +  pi*TF(0 *)  +  is*tG(0*)  =  0. 

Otherwise,  suppose  (b)  that  0*  G  d Q  and  suppose  V0log p(x\  0*)  ^  0.  Then  since  we  have 
f(0)  =  0,  g(0)  <  0  for  0  G  ©  and  f(0)  =  0,  g(0)  >  0  for  0  ^  ©,  and  F(0*),  — G(0 *)  must 
span  the  convex  set  locally  at  0*,  then,  in  particular,  there  exists  a,  pi,  is  with  is  >  0  such 
that 

piT F(0*)  -  istG(0 *)  =  — eV0  logp(£c;  0*) 
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for  some  small  e  >  0.  This  is  because  the  gradient  of  the  negative  log-likelihood  must  be  in 
the  direction  of  the  convex  set,  otherwise  6*  is  not  optimal.  Therefore,  /i*  =  —  e-1/i  and 
v*  =  e-1^  gives  the  desired  KKT  condition  equation  24.  (Note  this  requires  the  inequality 
constraint  -  for  the  strict  equality  constraint  the  set  is  not  necessarily  convex!) 


Conversely,  suppose  we  have  the  KKT  condition  equation  24.  If  the  gradient  of  the 
log-likelihood  is  zero  at  a  stationary  point,  then  that  stationary  point  is  in  int©  and  the 
Bertsekas  condition  is  trivial.  So  suppose  it  is  nonzero  and  thus  our  stationary  point  6*  is 
on  the  boundary  <9©.  Note  g(0*)  =  0  on  d@.  Recall  the  Lagrangian  in  equation  19.  Then 
note  that  for  any  6  in  ©  we  have  that 


L(0*,  //\  v*)  +  \7gL(0*,  fj.*,  v*)(0  -  0*)  +  o(  1) 

L(0*,  M*,  v*)  -  V£  logp(a;;  0*)(0  -  0*) 

+v*tf(6*)(6  -  e*)  +  u*tg(0*)(0  -  e*)  +  o(i) 
L(0* ,  M*,  logp(*;  0* )(0  -  0* ) 

+ST(f(0)  -  f(0*))  +  v*T(9(0)  -  9(0*))  +  0(1) 
L(0*,  »*,  v*)  ~  logp(*;  0*)(0  -  0*)  +  v*Tg{0)  +  o(l). 


Since,  6*  is  optimal  then  L(d,  fi* ,u*)  >  L(0* ,  g,*,u*),  provided  we  scale  u*  to  be 
sufficiently  small.  Thus,  —  log p(x;0*)(0  —  6*)  >  —  v*1  g{6)  T  o(l) .  Then  d  can  be  made 
arbitrarily  close  to  0*  to  force  o(l)  — >  0  and  v*  can  be  scaled  arbitrarily  small  and  we  have 

— V^log p(x;0*)(9-9*)  >  0. 
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B.  Taylor  Expansion  Derivation 

Given  the  general  iterative  form  equation  33,  we  derive  the  CSA  via  a  Taylor  expansion. 
Note  that  for  any  iterate  6k  satisfying  the  constraints,  the  space  spanned  by  the  gradient  of 
the  active  constraints  at  6k,  namely  F(8k),  are  the  directions  for  steepest  descent  and 
ascent  of  f .  Thus,  for  the  step  dk  to  be  feasible  (or  nearly  so),  it  is  necessary  that  dk  be 
orthogonal  (or  nearly  so)  to  this  gradient  matrix,  i.e.,  dk  €  span  U(8k).  Let  d'k  be  such 
that  dk  =  U(6k)d'k  where  U(6k )  is  defined  in  equation  5.  So,  the  iterative  form  is  now 

6k+ i  —  6k  +  oik  '  u(et)dl  (b.i) 

Now,  consider  the  Taylor  expansion  of  the  log-likelihood  about  6k  along  the  vector 
U(6k)d'k.  This  is  given  by 

log p(x]  6k  +  U (6k)d'k)  =  log p(x;  6k)  +  log p(x;  6k)U ( 6k)d'k 

+^d.JUT(0t)Vl  logj>(*;fli)n(flt)4  +  o(||4||) 

~  log p(x]  6k)  +  Vq  logp(;c;  6k)U ( 6k)d'k 

-l-^uT\ek)ngk)u(ekUk. 

Note  above  we  replace  the  negative  Hessian  with  the  Fisher  Information  as  well  as  drop  the 
higher  order  terms.  This  gives,  approximately,  a  quadratic  log-likelihood  model  restricted 
to  the  subspace  spanned  by  the  columns  of  U(6k),  i.e.,  the  space  which  locally  preserves 
the  active  constraints  f(6k)  =  0,  which  is  why  it  is  refered  to  as  the  null  space.  The 
maximum  of  this  quadratic  model 

*(4)  =  Vj  log  j,(*;  0k)U(0k)d't  -  i dkTUT(0k)I(0k)U(0k)cfk  (B.2) 

over  the  choice  of  d'k  must  be  the  stationary  point  of  the  approximation.  Hence, 
differentiating  equation  B.2  and  setting  the  result  to  0,  the  optimal  d'k  must  satisfy 

0  =  Vq  log p(x;  6k)U(6k )  -  UT(6k)I(6k)U(6k)d'k* . 

This  requires  more  than  just  the  non-singularity  of  the  UT(6k)I{6k)U(6k)  matrix,  since 
the  Hessian  of  the  log-likelihood  approximation  must  be  negative  definite  as  well  for  d'k  to 
maximize  this  function.  Indeed,  in  this  case,  the  Hessian  of  &{d'k)  is  -UT(6k)I(6k)U(6k), 
so  we  must  have  that  UT (6k)I(6k)U(6k)  is  positive  definite  as  well.  Solving  for  dk  and 
substituting  into  dk  in  equation  B.I  we  obtain  the  unrestored  version  of  the  CSA, 

6k+l  =  6k  +  akU (dk)(UT (6k)I(6k)U (6k))~1UT (6k)\7 glogp(x\  6k). 
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C.  Constrained  Least  Squares  Estimator 


We  will  show  that  the  CLSE  in  equation  65  is  both  unbiased  and  efficient  relative  to  the 
Marzetta  form  of  the  CCRB  under  the  assumption  that  x  ~  A f(H0o,  C).  We  only  show 
this  for  the  real  parameter  case,  although  the  complex  parameter  case  is  similar.  We  do  so 
without  the  nullspace  derivation  of  the  CLSE.  The  equality  shown  in  Theorem  3  is  an 
alternative  proof  of  this  fact.  Recall  that  the  MLE  0(x)  is  unbiased,  thus 

E0odCLSE(x )  =  (V1  -  rlFT  {FI~lFTyl  FI'1)  IEeo6(x)  -  I~lFT  {FI~lFTyl  v 

=  (V1  -  I^Ft  (FJ^F7)  _1  FI1 )  I0O  -  I~1Ft  (FI^Fy  ^  v 

=  e0  -  i~1ft  (fi-1ft)  ~l  ( fg0  +  v) 

=  eQ. 

So  the  CLSE  is  also  unbiased.  The  covariance  matrix  of  the  CLSE  is  given  by 
Cov0o(0clse(x )) 

=  EgodCLSE(x)GcLSE{x )  -  0o6j 

=  (l'1  -  I~1Ft  ( FI^Ft)~ 1  FI"1)  IEgo{0{x)OT{x))I  (l'1  -  I~1Ft  (FI^F7)-1  FI'1) 

-  (l”1  -  I~1Ft  (FI-lFTyl  FI 'x)  IE0o{6(x))vt  (FI^F7)'1  FI  1 
—I~1Ft  (FI^F7)-1  vEoo(6t(x))I  (l~l  -  IlFT  (FI~1FTy1  FI”1) 

+i~1ft  (Fi-^y-1  vvT  (FJ^F7)'1  FI1  -  e0e l 

=  (l'1  -  I  yFt  (F/“1Ft)~1  FI"1)  IEgo{6(x)0T{x))I  (l'1  -  I~lFT  (. FI~lFTyl  FI'1) 

-  (l'1  -  I-'F7'  (FI^F7)-1  FI"1)  IG0vT  (FI~1FTy1  FI1 
-I-lFT  (. FI~1FTy 1  vd^I  (l'1  -  I^F7  (FI'1FT)  1  FI'1) 

+I~1Fr  (FI^F7)-1  (FI^F7)-1  FI'1  -  0o0j 

=  (l'1  -  I^F7  (FI'1FT)_1  FI'1)  IEgo{6(x)0T{x))I  (V1  -  I'XFT  (FI^F7)-1  FI'1) 

+  ((l'1  -  IxFt  {FI~1Ft)~1  FI'1)  I6I0  -  I-xFt  (FI^F7)"1  vj  ■ 

((l'1  -  I~1Ft  (FI'1Fr)  1  FI'1)  I0O  -  I~lFT  (FI~1Ft)~1u)T 

-  (l'1  -  I-'F7'  (FI~lFTyl  FI'1)  I0odJl  (l'1  -  I':Ft  (FI^F7)"1  FI'1)  -  0o0j. 
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But  since  F60  +  v  =  0,  then 


( I _1  -  I~lFT  (FI~lFT)  1  FI~l^j  I0o  -  I~lFT  (. FI  lFT )  1  u 

=  ef>-  rlFT  (fi_1ft)  (F0O  +  v) 


So  this,  with  the  knowledge  that  the  MLE  6{x)  is  efficient  relative  to  the  CRB,  shows  that 
the  covariance  matrix  of  the  CLSE  is 

Covo0(Oclse{x)) 

=  (l~l  -  I~lFT  (FI~1FTy1  FI’1)  I  [E0o(0(s x)Gt{x))  -  0oQTo\ 
xl  (V1  -  rlFT  (yFI~1FT)1  FI-1) 

=  (V1  -  r1FT  (f/“1ft)  _1  Fi"1)  i  (V1  -  i1ft  (. fi1ft ) _1  f/-1) 

=  r1  -  /~1ft  (f/_1ft)  _1  FI1 


which  is  the  Marzetta  form  of  the  CCRB  (2).  Therefore  the  CLSE  is  both  unbiased  and 
efficient. 
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